QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Hive: A data warehouse on Hadoop

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.

ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.

CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.

UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

A warehouse solution over map-reduce framework Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff.

Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.

Query Processing Presented by Aung S. Win.

Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Database Systems: Design, Implementation, and Management Tenth Edition Chapter 11 Database Performance Tuning and Query Optimization.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Cloud Computing Other High-level parallel processing languages Keke Chen.

MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

HAP 709 – Healthcare Databases SQL Data Manipulation Language (DML) Updated Fall, 2009.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

Database Management 9. course. Execution of queries.

Hive Facebook 2009.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Academic Year 2014 Spring. MODULE CC3005NI: Advanced Database Systems “QUERY OPTIMIZATION” Academic Year 2014 Spring.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Restore ： Reusing results of mapreduce jobs Jun Fan.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

An Introduction to HDInsight June 27 th,

A NoSQL Database - Hive Dania Abed Rabbou.

RESTORE IMPLEMENTATION as an extension to pig Vijay S.

Unit-1 Introduction Prepared by: Prof. Harish I Rathod

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Component 4: Introduction to Information and Computer Science Unit 6: Databases and SQL Lecture 3 This material was developed by Oregon Health & Science.

Component 4/Unit 6c Topic III Structured Query Language Background information What can SQL do? How is SQL executed? SQL statement characteristics What.

Switch off your Mobiles Phones or Change Profile to Silent Mode.

SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Introduction to Code Generation and Intermediate Representations

Chapter 1 Introduction Major Data Structures in Compiler

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

An Effective SPARQL Support over Relational Database Jing Lu, Feng Cao, Li Ma, Yong Yu, Yue Pan SWDB-ODBIS 2007 SNU IDB Lab. Hyewon Lim July 30 th, 2009.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Chapter 13: Query Processing

1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.

Chapter 1 Overview of Databases and Transaction Processing.

CS 405G: Introduction to Database Systems

Advanced Computer Systems

MySQL Subquery Source: Dev.MySql.com

Database Management System

Database Performance Tuning and Query Optimization

Chapter 15 QUERY EXECUTION.

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Query Processing CSD305 Advanced Databases.

Charles Tappert Seidenberg School of CSIS, Pace University

Chapter 11 Database Performance Tuning and Query Optimization

Query Optimization.

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

ReStore: Reusing Results of MapReduce Jobs

Presentation transcript:

QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Content Introduction System Overview Query Rewriting Cost Model Implementation Experiments

Introduction High-level query languages such as Hive, Pig based on MapReduce have been widely used Performance bottlenecks of current RDBMS-based infrastructure appear in traditional enterprises Hive can not fully support the SQL syntax at the moment Even if some SQL queries used in RDBMS can be directly accepted by Hive, their performance might be very low in the Hadoop

contribution Translate SQL to optimized HiveQL A cost model is proposed to reflect the execution time of MapReduce jobs An algorithm is designed to reorganize the join structure so as to construct the near-optimal query

SECICS System The total amount of data is 20TB and there is about 30GB new data added into the database every day three kinds of data in SECICS: ▫Meter data:collected by smart meters ▫Archive data: records the detailed archived information of meter data ▫Statistic data: the result of offline batch analysis

Background

Low data write throughput ▫RDBMS with complex indexes can not provide enough write throughput Unsatisfied statistics analyzing capability ▫The average processing time even reaches 3 to 4 hours Weak scalability ▫scaling out RDBMS mostly leads to redesign of the sharding strategies as well as a lot of application logic. Uncontrollable resource competition

The migration of Stored Procedures

Overview

Four Components SQL Interpreter: ▫resolves the SQL query provided by a user and parses that query into an Abstracted Syntax Tree Query Rewriter: ▫a Rule-Based Rewriter (RBR) checks if a query matches a series of static rules, new equivalent queries will be generated ▫Cost-Based Optimizer (CBO) is used to further optimize the join structure for each query

Four Components Statistics Collector ▫collecting statistics of related tables and their columns Plan Evaluator ▫The queries with equivalent join cost generated by RBR will be sent to it

QUERY REWRITING Rule-based Rewriter ▫detect the SQL clauses that are not supported well by Hive and transform them into HiveQL ▫initial rules are first invoked to check if the query can be rewritten ▫the RBR will traverse the subqueries of each query and apply rules to them recursively ▫all rewritten queries are generated and sent to the CBO

Example lvRate(uid,deviceid,isMissing,date,type) dataProfile(dataid,uid,isActive) dataRecord(dataid,date,consumption) powerCut(uid,date) gprsUsage(deviceid,dataid,date,gprs) deviceInfo(deviceid,region,type)

Basic UPDATE Rule This rule translates UPDATE into SELECT statement by putting the simpleCondition to selectList UPDATE lvRate a SET a.isMissing=true LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c on b.dataid=c.dataid AND a.date=c.date WHERE c.dataid IS NULL INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,IF(c.dataid IS NULL,true,false) as isMissing,a.date,a.type FROM lvRate LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c ON b.dataid=c.dataid AND a.date=c.date

(NOT) EXISTS Rule transforms that subquery into a LEFT OUTER JOIN and replaces that (NOT) EXISTS condition with join Column IS (NOT) NULL DELETE FROM lvRate a WHERE NOT EXISTS (SELECT 1 FROM powerCut b WHERE a.uid=b.uid AND a.date=b.date ) INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,a.isMissing,a.date,a.type FROM lvRate a LEFT OUTER JOIN ( SELECT uid,date FROM powerCut) b ON a.uid=b.uid AND a.date=b.date WHERE b.uid IS NULL

Cost-based Optimizer SELECT sum(gprs), type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid JOIN dataRecord C ON A.dataid = C.dataid AND A.date = C.date JOIN dataProfile D ON C.dataid = D.dataid LEFT OUTER JOIN powerCut E ON D.uid = E.uid AND A.date = E.date WHERE E.uid IS NULL AND A.date=’ ’ GROUP BY B.type SELECT sum(gprs), type FROM( SELECT T1.gprs, T1.date, T1.type, T2.uid FROM (SELECT A.gprs, A.dataid, A.date, B.type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid WHERE A.date=’ ’ )T1 JOIN (SELECT C.dataid, C.date, D.uid FROM dataRecord C JOIN dataProfile D ON C.dataid = D.dataid)T2 ON T1.dataid = T2.dataid

Cost-based Optimizer

Different from traditional databases, MapReduce- based query processing will write join intermediate results back to HDFS and the next join operation will read it from HDFS too, causing big I/O costs the main difference in intermediate results is that the left-deep plan generates A B C and the bushy plan generates C D B may has worse performance as jobs will compete for computing resources

COST MODEL Cost of MapReduce ▫Map phase can be divided into three subphases, which are Map, Spill and Merge. ▫Reduce phase also includes three parts, Shuffle, Merge and Reduce Map ▫For each Mapper:

Mapper Cost Model Spill Merge Different from normal MapReduce jobs, in Hive, the internal logic of mappers may vary depending on the specific table to be processed.

Reduce In the reduce phase, shuffle is responsible for fetching mappers outputs to their corresponding reducers

Merge Reduce Total Cost

Cost of Operators in Map and Reduce In order to calculate the costs, a few sample queries based on TPC-H are designed as probes to collect the execution time of operators given a chain with n operators,the cost is evaluated as:

Cost of Workflow A HiveQL query is finally compiled to MapReduce workflows (a directed acyclic graph) where each node is a single MapReduce job and the edge represents the dataflow

Experiments evaluate the correctness and efficiency of Qmapper the efficiency of translating SQL into HiveQL and the efficiency of HiveQL execution comparing QMapper with manually translated work TPC-H will demonstrate the execution efficiency of HiveQL generated by Qmapper Smart Grid application will show the correctness and translation efficiency of QMapper

Join Performance

Scalability

Accuracy of Cost Model