Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

When Data Management Systems Meet Approximate Hardware: Challenges and Opportunities Author: Bingsheng He (Nanyang Technological University, Singapore)

Daniel Schall, Volker Höfner, Prof. Dr. Theo Härder TU Kaiserslautern.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Introduction CSCI 444/544 Operating Systems Fall 2008.

st International Conference on Parallel Processing (ICPP)

1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.

Shimin Chen Big Data Reading Group Presented and modified by Randall Parabicoli.

Shimin Chen Big Data Reading Group.  Energy efficiency of: ◦ Single-machine instance of DBMS ◦ Standard server-grade hardware components ◦ A wide spectrum.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Presented by Marie-Gisele Assigue Hon Shea Thursday, March 31 st 2011.

FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.

1 AppliedMicro X-Gene ® ARM Processors Optimized Scale-Out Solutions for Supercomputing.

GPGPU platforms GP - General Purpose computation using GPU

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Analyzing the Energy Efficiency of a Database Server Hanskamal Patel SE 521.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

High Performance Computing G Burton – ICG – Oct12 – v1.1 1.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

PERFORMANCE STUDY OF BIG DATA ON SMALL NODES. Ομάδα: Παναγιώτης Μιχαηλίδης Αντρέας Σόλου Instructor: Demetris Zeinalipour.

Revision - 01 Intel Confidential Page 1 Intel HPC Update Norfolk, VA April 2008.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

NFV Compute Acceleration APIs and Evaluation

Seth Pugsley, Jeffrey Jestes,

Linchuan Chen, Xin Huo and Gagan Agrawal

Chapter 15 QUERY EXECUTION.

The Yin and Yang of Processing Data Warehousing Queries on GPUs

Presentation transcript:

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore 1

Outline Motivations System Design Evaluations Conclusion 2

Query Processing in the Era of IoT 3 IoT devices collect a lot of information and stores them in databases like SQLite, MySQL and SQL Server Express. These lightweight databases evaluate user queries from many mobile applications. Energy consumption of query processing is important on such battery-powered systems.

Embedded GPUs are Emerging Embedded GPUs have been incorporated in new-generation embedded devices. – They offer higher performance for video/image processing than embedded CPUs. – However, they have higher powers than embedded CPUs. 4

The Embedded CPU-GPU Architecture 5 CARMAGPU Workstation CPUARM Cortex-A9 (quad-core, 1.3GHz, ~ 9W) Intel Xeon E (6-core, 2GHz, ~ 95W) GPUNVIDIA Quadro 1000M (96-core, ~ 45W) NVIDIA Tesla K40C (2880-core, ~ 245W) Memory2GB16GB Storage4GB eMMC256GB SSD PCI4x PCIe Gen 1 (250MB/s)16x PCIe Gen 3 (985MB/s) Idle power~ 10W~ 80W Peak power~ 50W~ 400W The power of the embedded GPU is 5 times higher than its CPU counterpart. The power of CARMA is 8 times lower than a workstation.

Our Questions Is it more energy-efficient to use embedded GPUs for query processing? – Challenge: the embedded GPU is more powerful, but it consumes more power. Can we further improve the energy efficiency by exploiting CPU-GPU co-processing on such embedded CPU-GPU architectures? – Challenge: PCIe bus is slow. 6

Outline Motivations System Design Evaluations Conclusion 7

Methodology 1.Build a query processor on CARMA. 2.Consider three types of executions: CPU-only, GPU-only, and CPU-GPU co-processing. 3.Two complementary approaches Micro-benchmarks: individual query operators Macro-benchmarks: queries (e.g., TPC-H) 8 CPU core GPU core 1-σ σ GPU core

Layered Design of the Query Engine 9 Layered Design (adopted from GPUQP) Four common operators as micro- benchmarks. Scan and hash indexes require relatively fewer storage. CPU/GPU parallel algorithms are used for each primitive. Relations are stored in the column stores.

Implementations CPU – We adopt codes from related work (VLDB’13) and our recent implementation on multi-core CPUs (VLDB’15). GPU – We adopt codes from our previous work (TODS’09 and VLDB’13). – We re-optimized each CUDA kernel for CARMA. CPU-GPU co-processing – Inputs are partitioned and distributed between the CPU and the GPU according to σ. – Final results are achieved by merging/concatenating partial results from both processors. 10

Experiences Some state-of-the-art libraries cannot be easily cross- compiled. – We have to manually edit the architecture-specific codes and makefiles, and build them from scratch. Peripherals of CARMA are not stable. – The Ethernet connection occasionally slows down dramatically. HDMI ports are almost broken. We lack technical means for fine-grained energy measurements. – We cannot measure the power of CPU or the GPU separately. Power of the board is occasionally abnormal. 11

Outline Motivations System Design Evaluations Conclusion 12

Evaluation Setup Workloads – Operators: selection, sum, sort and hash join – Queries: TPC-H query 9 and 14 – R and S relations 50M tuples each Two attributes in each tuple: (32-bit key, 32-bit record-ID) – TPC-H query 9 and 14 Scale factor: 0.5 (size=500 MB) Metrics – Execution time – Energy consumption 13 Watts up? PRO power meter

Evaluations: Selection & Sum The CPU-only approach delivers the best performance and the lowest energy consumption for selection and sum which are all memory intensive. 14 SelectionSum

Evaluations: Sort & Hash Join GPU becomes more competitive when the workload is more computation intensive. The fastest execution does not always guarantee the lowest energy consumption (e.g., when σ=0.5 in sort). 15 SortHash join

Evaluations: TPC-H Q9 and Q14 The GPU-only outperforms the CPU-only for selected analytical queries. The CPU-GPU co-processing achieves the best performance and the lowest energy consumption. 16 Q9Q14

Comparison with a Workstation CARMA is more energy efficient only when the workload size is small. 17 Partitioning of input relations are needed for CARMA when the size is larger than 5 million.

Outline Motivations System Design Evaluations Conclusions 18

Conclusions Embedded GPUs have become an integral component in embedded systems. Although the embedded GPU consumes more power, its higher computation capability and memory bandwidth are still beneficial for energy-efficient query processing. – The embedded CPU is more energy-efficient when processing simple operators such as selection and sum. – The embedded GPU outperforms the embedded CPU for computation-intensive operators such as sort and hash join as well as analytical queries. The CPU-GPU co-processing can further increase the energy efficiency of query processing on embedded devices. 19

Towards Energy-Proportional “Wimpy- Node” Cluster Network – A single UDT connection through the Ethernet can only maintain a speed of 29.8 MB/s. – With Ethernet Jumbo Frames enabled, this can be increased to 41.2 MB/s. – By allocating two CPU cores to handle two connections in parallel, the bandwidth can be further increased to 83.4 MB/s. Storage – Current results are achieved based on an eMMC storage. – Replacing eMMCs with SSD disks, the energy efficiency of both the CPU-only and GPU-only approach can be increased by 7% and 25% for sort. 20

Acknowledgement We thank NVIDIA for the hardware donation. This work is supported by the following grants and institutions. – The National Research Foundation, Prime Ministers Office, Singapore under its IDM Futures Funding Initiative and administered by the Interactive and Digital Media Programme Office (Grant No.: MDA/IDM/2012/8/8-2 VOL 01). – A MoE AcRF Tier 2 grant (MOE2012-T ) in Singapore. 21

Q & A Thank you. Our research group: Xtra Computing Group