A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1

Outline Background and Motivations – Data Partitioning on FPGA – OpenCL on FPGA Design Experiment Conclusion 2

What is Data Partitioning? Data partitioning divides the input table (with tuples) into a number of partitions according to input partitioning function. – It splits the input table into cache-conscious ono- overlapping sub-problems. – It is a building block in many database applications (e.g., hash join and aggregation). 3

What is Data Partitioning? 4 Sequential memory read Partitioning function Partitions 1 2 P 2 1 P 0 1 2 0 P Random memory write Input table 0 … Partitioning is a memory intensive operation.

Benchmarking Memory Subsystem 1, Sequential bandwidth > Random bandwidth 5 Linear Sub-linear 2, Random memory access is more sensitive to data access unit size. Use Long8, not byte

Outline Background and Motivations – Data Partitioning on FPGA – OpenCL on FPGA Design Experiment Conclusion 6

What is OpenCL? OpenCL has been developed for heterogeneous computing environments, e.g. CPU+GPU/FPGA, with a host-accelerator model of program execution. – Host processor (CPU) runs control-intensive task. – Accelerator (GPU/FPGA) runs computation-intensive code (i.e., kernel). 7

OpenCL on FPGA 8 Global memory: external DDR. Local memory: on-chip memory blocks. Private memory: logic blocks.

OmniDB on FPGA OmniDB [1]: State-of-the-art OpenCL-based query processor on CPU/GPU. Directly Mapping OmniDB to FPGA has the serious problem. – Lock overhead. 9 [1] Shuhao Zhang and et al. OmniDB: Towards Portable and Efficient Query Processing on Parallel CPU/GPU Architectures, VLDB’13.

Why Lock is Required? 10 Partitioning function Partitions 1 2 P 2 1 P 0 1 2 1 P Conflict Input table 0 … Consistency: one lock for each partition. 4 work items

Existing OpenCL Implementaions 11 Multiple kernels High latency Global lockLocal lock One kernel Low latency

Target for Data Partitioning 12 Multiple Kernels and Low Latency Require Great Efforts We need help from new OpenCL feature (channel).

Impact of Channel 13 Kernel == Verilog Module Channel == Fifo

Outline Background and Motivations Design Experiment Conclusion 14

Our Proposal Multi-kernel partitioning with channel is presented to attack the lock overhead. 15 On-chip buffers are used to efficiently utilize memory subsystem on OpenCL-based FPGA.

Multi-kernel Partitioning 16 Multiple kernels execute concurrently in producer-consumer manner.

Data_in Kernel 17 1, Load W tuples from DDR to tuples[W]. 2, For (i ← 0 to W ) do 3, Compute the index j of consumer kernel for tuple[i]. 4, Write tuple[i] to the consumer kernel j via channel. consumer kernel: Data_out kernel or Skewed_handling kernel Throughput: One cycle for one tuple. 1/W memory transactions

Data_out Kernel 18 1/S memory transactions 1, Read tuple from channel. 2, Compute the partition index of tuple. 3, Update the counter ( local ) of partition. 4, Store the tuple to on-chip bucket. 5, If (bucket has S tuples) then Store the whole bucket to global memory. Critical path: 7 cycles Throughput: seven cycles for one tuple

Skewed_handling Kernel 1, Read the tuple from channel. 2, Update counter ( private ) of skewed partition. 3, Store the tuple to on-chip bucket. 4, If (bucket has S tuples) then Store the whole bucket to global memory. 19 Critical path: 1 cycle. 1/S global memory transactions. Throughput: One cycle for one tuple.

Cost Model Given the limitation of FPGA resource, choosing the optimal configuration for the following parameters: – S: number of slots in the on-chip bucket for each partition. – DO: number of Data_out kernels which concurrently handle the input tuples. 20 The ranges of S and DO are small, so we consider all the possible combinations.

Experimental Setup Platform: – Terasic’s DE5-Net board: Altera Stratix V A7 and 4GB 2-bank DDR3. – Altera OpenCL SDK version 14.0. Data Sets: – Tuple format:. Both keys and payloads are 4-bytes. – The probability of individual keys follows a Zipf distribution, with the Zipf factor [0, 1.75]. 22

Evaluation of Cost Model 23 Our cost model can capture the performance trend with different number (S) of slots. (DO=8) Memory Lock

Impact of Skew_handling Kernel 24 3.1X Our cost model can capture the performance trend with different number (S) of slots.

Impact of Number of Partitions 25 More Stable

Impact of Size of Tuples 26 10.7X Good scalability

Conclusion We demonstrate the significant overheads on locks and memory accesses of data partitioning on FPGAs. We develop a new multi-kernel partitioning approach together with on-chip buckets to address those overheads. Our proposed approach can achieve 10.7X speedup over the existing OpenCL implementation. 28

Q & A Our Terasic’s DE5-Net FPGA board is denoted by Altera University Program. This work is in part supported by a MoE AcRF Tier 2 grant (MOE2012-T2-1-126) in Singapore. Our research group: Xtra Computing Group http://pdcc.ntu.edu.sg/xtra/ http://pdcc.ntu.edu.sg/xtra/ 29

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Similar presentations

Presentation on theme: "A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Similar presentations

Presentation on theme: "A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1."— Presentation transcript:

Similar presentations

About project

Feedback