Prediction-Based Multivariate Query Modeling Analytic Queries.

Prediction-Based Multivariate Query Modeling Analytic Queries

MapReduce is an important data-centric programming model. To ease its programmability, a set of data warehouse systems and query languages are developed atop MapReduce. Hive and Pig are popular data warehouse systems. In 2009, more than 40% of Hadoop production jobs at Yahoo! were Pig programs. In Facebook, 95% of MapReduce jobs are generated by Hive.

Hive and Hadoop Users submit queries of Hive SQL, subset of SQL used in unstructured world of Hadoop. In Hive, each SQL query is compiled and translated into a DAG (Directed Acyclic Graph) of MapReduce jobs with inner-dependencies.

MapReduce adopts a job-level scheduling policy to strive for balanced distribution of tasks and effective utilization of resources. However, such simplistic policy is unable to reconcile the dynamics of different jobs in complex analytic queries.

Loss of query semantics during job submission (Hadoop side only sees individual jobs) MR Jobs Parser HiveQL Queries Semantic Analyzer Planner Optimizer MapredWork Generator Execution Engine Hive Task Scheduler JobTracker Job Listener Hadoop MapReduce Task Tracker … … Results J3(Q1) J2(Q1) J4(Q2) … Job Queue Runnable JobUn-submitted job Completed job

Semantic gap: between Hive and Hadoop Hadoop is un-aware of such dependency and inter-job relationship, just treating all jobs as the same. Consequences: Suboptimal query response efficiency Unfairness among queries

To implement, we add modules: Semantics extraction (DAG, operator type, predicates, etc.) JobTracker TaskTracker Hadoop JobListener Semantics Extraction TaskTracker Two-Level Scheduling Multivariate Prediction (Selectivity Estimation) Execution Engine Parser Semantics Analyzer HiveQL Queries Job & Semantics Results Hive

Multivariate Query Modeling Dynamically allocate resources among workflows and prioritize latency-sensitive small queries. Categorize MapReduce jobs in the Hive queries into three types with respect to three major operators: groupby, join and extract.

Multivariate job time modeling Job time prediction model Model and predict job execution time based on selectivity estimation Training on over 5647 MR jobs, about 1000 queries from TPC-DS and TPC-H of different scales.

What to do with jobs in an active queries when a new query arrives and has a lower estimated resource consumption that the active query. 1. Kill running jobs to make rooms for the new query. 2. Wait for running jobs to finish.

CROSS Leads to better average query response performance. For Bing, 43.9% and 27.4% better than HFS and HCS For Facebook, 40.2% and 72.8% better than HFS and HCS

Conclusion and future work Semantic gap in MapReduce data warehouse system causes performance and fairness issues. In this framework, we enable selectivity estimation, time modeling. It achieves significant performance and fairness improvement. E.g., 43.9% better performance and 59.8% better fairness over HFS. In the future, deal with query progress indicator and new challenges.

Prediction-Based Multivariate Query Modeling Analytic Queries.

Similar presentations

Presentation on theme: "Prediction-Based Multivariate Query Modeling Analytic Queries."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prediction-Based Multivariate Query Modeling Analytic Queries.

Similar presentations

Presentation on theme: "Prediction-Based Multivariate Query Modeling Analytic Queries."— Presentation transcript:

Similar presentations

About project

Feedback