QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.

QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015

Content Introduction System Overview Query Rewriting Cost Model Implementation Experiments

Introduction High-level query languages such as Hive, Pig based on MapReduce have been widely used Performance bottlenecks of current RDBMS-based infrastructure appear in traditional enterprises Hive can not fully support the SQL syntax at the moment Even if some SQL queries used in RDBMS can be directly accepted by Hive, their performance might be very low in the Hadoop

contribution Translate SQL to optimized HiveQL A cost model is proposed to reflect the execution time of MapReduce jobs An algorithm is designed to reorganize the join structure so as to construct the near-optimal query

SECICS System The total amount of data is 20TB and there is about 30GB new data added into the database every day three kinds of data in SECICS: ▫Meter data:collected by smart meters ▫Archive data: records the detailed archived information of meter data ▫Statistic data: the result of offline batch analysis

Background

Low data write throughput ▫RDBMS with complex indexes can not provide enough write throughput Unsatisfied statistics analyzing capability ▫The average processing time even reaches 3 to 4 hours Weak scalability ▫scaling out RDBMS mostly leads to redesign of the sharding strategies as well as a lot of application logic. Uncontrollable resource competition

The migration of Stored Procedures

Overview

Four Components SQL Interpreter: ▫resolves the SQL query provided by a user and parses that query into an Abstracted Syntax Tree Query Rewriter: ▫a Rule-Based Rewriter (RBR) checks if a query matches a series of static rules, new equivalent queries will be generated ▫Cost-Based Optimizer (CBO) is used to further optimize the join structure for each query

Four Components Statistics Collector ▫collecting statistics of related tables and their columns Plan Evaluator ▫The queries with equivalent join cost generated by RBR will be sent to it

QUERY REWRITING Rule-based Rewriter ▫detect the SQL clauses that are not supported well by Hive and transform them into HiveQL ▫initial rules are first invoked to check if the query can be rewritten ▫the RBR will traverse the subqueries of each query and apply rules to them recursively ▫all rewritten queries are generated and sent to the CBO

Example lvRate(uid,deviceid,isMissing,date,type) dataProfile(dataid,uid,isActive) dataRecord(dataid,date,consumption) powerCut(uid,date) gprsUsage(deviceid,dataid,date,gprs) deviceInfo(deviceid,region,type)

Basic UPDATE Rule This rule translates UPDATE into SELECT statement by putting the simpleCondition to selectList UPDATE lvRate a SET a.isMissing=true LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c on b.dataid=c.dataid AND a.date=c.date WHERE c.dataid IS NULL INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,IF(c.dataid IS NULL,true,false) as isMissing,a.date,a.type FROM lvRate LEFT OUTER JOIN dataProfile b ON a.uid=b.uid LEFT OUTER JOIN dataRecord c ON b.dataid=c.dataid AND a.date=c.date

(NOT) EXISTS Rule transforms that subquery into a LEFT OUTER JOIN and replaces that (NOT) EXISTS condition with join Column IS (NOT) NULL DELETE FROM lvRate a WHERE NOT EXISTS (SELECT 1 FROM powerCut b WHERE a.uid=b.uid AND a.date=b.date ) INSERT OVERWRITE TABLE lvRate SELECT a.uid,a.deviceid,a.isMissing,a.date,a.type FROM lvRate a LEFT OUTER JOIN ( SELECT uid,date FROM powerCut) b ON a.uid=b.uid AND a.date=b.date WHERE b.uid IS NULL

Cost-based Optimizer SELECT sum(gprs), type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid JOIN dataRecord C ON A.dataid = C.dataid AND A.date = C.date JOIN dataProfile D ON C.dataid = D.dataid LEFT OUTER JOIN powerCut E ON D.uid = E.uid AND A.date = E.date WHERE E.uid IS NULL AND A.date=’2014-01-01’ GROUP BY B.type SELECT sum(gprs), type FROM( SELECT T1.gprs, T1.date, T1.type, T2.uid FROM (SELECT A.gprs, A.dataid, A.date, B.type FROM gprsUsage A JOIN deviceInfo B ON A.deviceid = B.deviceid WHERE A.date=’2014-01-01’ )T1 JOIN (SELECT C.dataid, C.date, D.uid FROM dataRecord C JOIN dataProfile D ON C.dataid = D.dataid)T2 ON T1.dataid = T2.dataid

Cost-based Optimizer

Different from traditional databases, MapReduce- based query processing will write join intermediate results back to HDFS and the next join operation will read it from HDFS too, causing big I/O costs the main difference in intermediate results is that the left-deep plan generates A B C and the bushy plan generates C D B may has worse performance as jobs will compete for computing resources

COST MODEL Cost of MapReduce ▫Map phase can be divided into three subphases, which are Map, Spill and Merge. ▫Reduce phase also includes three parts, Shuffle, Merge and Reduce Map ▫For each Mapper:

Mapper Cost Model Spill Merge Different from normal MapReduce jobs, in Hive, the internal logic of mappers may vary depending on the specific table to be processed.

Reduce In the reduce phase, shuffle is responsible for fetching mappers outputs to their corresponding reducers

Merge Reduce Total Cost

Cost of Operators in Map and Reduce In order to calculate the costs, a few sample queries based on TPC-H are designed as probes to collect the execution time of operators given a chain with n operators,the cost is evaluated as:

Cost of Workflow A HiveQL query is finally compiled to MapReduce workflows (a directed acyclic graph) where each node is a single MapReduce job and the edge represents the dataflow

Experiments evaluate the correctness and efficiency of Qmapper the efficiency of translating SQL into HiveQL and the efficiency of HiveQL execution comparing QMapper with manually translated work TPC-H will demonstrate the execution efficiency of HiveQL generated by Qmapper Smart Grid application will show the correctness and translation efficiency of QMapper

Join Performance

Scalability

Accuracy of Cost Model

QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.

Similar presentations

Presentation on theme: "QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.

Similar presentations

Presentation on theme: "QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015."— Presentation transcript:

Similar presentations

About project

Feedback