Prediction-Based Multivariate Query Modeling Analytic Queries.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

MapReduce.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Resource Management with YARN: YARN Past, Present and Future
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.
Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
AStudy on the Viability of Hadoop Usage on the Umfort Cluster for the Processing and Storage of CReSIS Polar Data Mentor: Je’aime Powell, Dr. Mohammad.
Hive: A data warehouse on Hadoop
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
UT DALLAS Erik Jonsson School of Engineering & Computer Science FEARLESS engineering Secure Data Storage and Retrieval in the Cloud Bhavani Thuraisingham,
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
HADOOP ADMIN: Session -2
Software and Services Group SQL (92 and Beyond) Support for Hive Jason Dai Principal Engineer Intel SSG (Software and Services Group)
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Introduction to Hadoop and HDFS
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Jockey Guaranteed Job Latency in Data Parallel Clusters Andrew Ferguson, Peter Bodik, Srikanth Kandula, Eric Boutin, and Rodrigo Fonseca.
Hadoop Ali Sharza Khan High Performance Computing 1.
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.
Restore : Reusing results of mapreduce jobs Jun Fan.
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Using Map-reduce to Support MPMD Peng
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Scalable and Coordinated Scheduling for Cloud-Scale computing
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Using Map-reduce to Support MPMD Peng
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Image taken from: slideshare
Performance Assurance for Large Scale Big Data Systems
OPERATING SYSTEMS CS 3502 Fall 2017
SAS users meeting in Halifax
Yarn.
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Hadoop EcoSystem B.Ramamurthy.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Overview of big data tools
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

Prediction-Based Multivariate Query Modeling Analytic Queries

MapReduce is an important data-centric programming model. To ease its programmability, a set of data warehouse systems and query languages are developed atop MapReduce. Hive and Pig are popular data warehouse systems. In 2009, more than 40% of Hadoop production jobs at Yahoo! were Pig programs. In Facebook, 95% of MapReduce jobs are generated by Hive.

Hive and Hadoop Users submit queries of Hive SQL, subset of SQL used in unstructured world of Hadoop. In Hive, each SQL query is compiled and translated into a DAG (Directed Acyclic Graph) of MapReduce jobs with inner-dependencies.

MapReduce adopts a job-level scheduling policy to strive for balanced distribution of tasks and effective utilization of resources. However, such simplistic policy is unable to reconcile the dynamics of different jobs in complex analytic queries.

Loss of query semantics during job submission (Hadoop side only sees individual jobs) MR Jobs Parser HiveQL Queries Semantic Analyzer Planner Optimizer MapredWork Generator Execution Engine Hive Task Scheduler JobTracker Job Listener Hadoop MapReduce Task Tracker … … Results J3(Q1) J2(Q1) J4(Q2) … Job Queue Runnable JobUn-submitted job Completed job

Semantic gap: between Hive and Hadoop Hadoop is un-aware of such dependency and inter-job relationship, just treating all jobs as the same. Consequences: Suboptimal query response efficiency Unfairness among queries

To implement, we add modules: Semantics extraction (DAG, operator type, predicates, etc.) JobTracker TaskTracker Hadoop JobListener Semantics Extraction TaskTracker Two-Level Scheduling Multivariate Prediction (Selectivity Estimation) Execution Engine Parser Semantics Analyzer HiveQL Queries Job & Semantics Results Hive

Multivariate Query Modeling Dynamically allocate resources among workflows and prioritize latency-sensitive small queries. Categorize MapReduce jobs in the Hive queries into three types with respect to three major operators: groupby, join and extract.

Multivariate job time modeling Job time prediction model Model and predict job execution time based on selectivity estimation Training on over 5647 MR jobs, about 1000 queries from TPC-DS and TPC-H of different scales.

What to do with jobs in an active queries when a new query arrives and has a lower estimated resource consumption that the active query. 1. Kill running jobs to make rooms for the new query. 2. Wait for running jobs to finish.

CROSS Leads to better average query response performance. For Bing, 43.9% and 27.4% better than HFS and HCS For Facebook, 40.2% and 72.8% better than HFS and HCS

Conclusion and future work Semantic gap in MapReduce data warehouse system causes performance and fairness issues. In this framework, we enable selectivity estimation, time modeling. It achieves significant performance and fairness improvement. E.g., 43.9% better performance and 59.8% better fairness over HFS. In the future, deal with query progress indicator and new challenges.