Running TPC-H On Pig Jie Li, Koichi Ishida, Muzhi Zhao, Ralf Diestelkaemper, Xuan Wang, Yin Lin CPS 216: Data Intensive Computing Systems Dec 9, 2011.

Slides:

Advertisements

Similar presentations

Shark:SQL and Rich Analytics at Scale

Advertisements

Paper by: Yu Li, Jianliang Xu, Byron Choi, and Haibo Hu Department of Computer Science Hong Kong Baptist University Slides and Presentation By: Justin.

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Spark: Cluster Computing with Working Sets

HadoopDB An Architectural Hybrid of Map Reduce and DBMS Technologies for Analytical Workloads Presented By: Wen Zhang and Shawn Holbrook.

Clydesdale: Structured Data Processing on MapReduce Jackie.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb

Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

Introduction to Hadoop and HDFS

QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.

Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

1 VLDB, Background What is important for the user.

Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.

SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.

The PostgreSQL Query Planner Robert Haas PostgreSQL East 2010.

Some slides adapted from those of Yuan Yu and Michael Isard

CPS216: Data-intensive Computing Systems

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Large-scale file systems and Map-Reduce

Query Processing Exercise Session 4.

Spark Presentation.

Informatica PowerCenter Performance Tuning Tips

Teradata Join Processing

Optimizing Big-Data Queries using Program Synthesis

Introduction to MapReduce and Hadoop

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Database Performance Tuning and Query Optimization

Pig Latin - A Not-So-Foreign Language for Data Processing

SpatialHadoop: A MapReduce Framework for Spatial Data

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Evaluation of Relational Operations: Other Operations

© Copyright TIBCO Software Inc.

Pig Latin: A Not-So-Foreign Language for Data Processing

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

On Spatial Joins in MapReduce

Physical Database Design

External Joins Query Optimization 10/4/2017

Faloutsos/Pavlo C. Faloutsos – A. Pavlo Lecture#13: Query Evaluation

Introduction to Apache

Chapters 15 and 16b: Query Optimization

CSE 491/891 Lecture 21 (Pig).

Introduction to Execution Plans

Charles Tappert Seidenberg School of CSIS, Pace University

Chapter 11 Database Performance Tuning and Query Optimization

Diving into Query Execution Plans

The Gamma Operator for Big Data Summarization

A – Pre Join Indexes.

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Presentation transcript:

Running TPC-H On Pig Jie Li, Koichi Ishida, Muzhi Zhao, Ralf Diestelkaemper, Xuan Wang, Yin Lin CPS 216: Data Intensive Computing Systems Dec 9, 2011

Goals Project 1 Project 2 develop correct Pig scripts compare with Hive’s TPC-H benchmark[1] Project 2 analyze the results and identify Pig’s bottlenecks rewrite some Pig scripts[2] [1] https://issues.apache.org/jira/browse/HIVE-600

Benchmark Set Up TPC-H 2.8.0 100GB data Hadoop 0.20.203.0 Pig 0.9.0 Hive 0.7.1 EC2 small instances (1.7GB memory, 160GB storage) 8 slaves each 2 map slots and 1 reduce slot Each job 8 reducers

Initial Result Except Q9 (Hive failed), only for Q16 Pig was faster than Hive. These Pig scripts were written in project 1.

Six Rules Of Writing Efficient Pig Scripts Reorder JOINs properly Use COGROUP for JOIN + GROUP Use FLATTEN for self-join Project before (CO)GROUP Remove types in LOAD Use hash-based aggregation

Rule 1: Reorder JOINs properly Join* = Map + Shuffle + Reduce = huge I/O Reorder Joins to minimize intermediate results Joins with less outputs first: Joins with small tables Joins with filtered tables Joins between primary-key and foreign-key * We focused on the default hash join. The replicated join does not apply to most of the TPC-H joins and its benefit is ignorable in most queries.

Apply Rule 1 to TPC-H Both Q7 and Q9 contains 5+ joins. Hive queries can also be rewritten in the same way.

Rule 2: COGROUP Condition: join followed by group-by on the same key Advantage: join and group can be done in a single COGROUP, that reduces the number of MapReduce jobs by one

Rule 2 Example SQL select A.x, COUNT(B.y) from A JOIN B on A.x = B.x GROUP by A.x Pig t1 = COGROUP A by A.x ,B by B.x; t2 = FOREACH t1 GENERATE group, COUNT(B.y);

Apply Rule 2 to TPC-H Query 13 COGROUP has less output than the join thus faster. Hive pushed the aggregation into the join.

Rule 3: FLATTEN Condition: group-by followed by self-join on the same key Advantage: the self-join can be performed in group-by after FLATTEN, that eliminates one MapReduce job

Rule 3 Example SQL Pig t1 = group A by x; select * from A as A1 where A1.y < ( select AVG(A2.y) from A as A2 where A2.x = A1.x ) Pig t1 = group A by x; t2 = foreach t1 generate FLATTEN(A), AVG(A.y) as avg_y; t3 = filter t2 by y < avg_y;

Apply Rule 2 and 3 to TPC-H Query 17 Q17 contains one regular join, one self join and one group-by, all on the same key pig (flatten) applies Rule 3 to perform the self-join in group-by. pig (cogroup+flatten) furthur applies Rule 2 to perform the regular join and group-by together in COGROUP.

Rule 4: Project before (CO)GROUP Pig doesn’t prune nested columns in (CO)GROUP Turns out to be the most effective rule Otherwise Rule 2&3 won’t take effect Open issue: https://issues.apache.org/jira/browse/PIG-1324

Rule 4 Example A = load 'A.in' as (a,b,c,d,e,f,g,h,i,j,k,l,m,n); A = foreach A generate a, b; -- project before GROUP t1 = GROUP A by a; t2 = foreach t1 generate group, SUM(A.b);

Rule 5: Remove types in LOAD With types, Pig casts them upon loading. Overhead! Without types, Pig does lazy conversion, but may uses a more expensive type! Is it possible to keep the types and do lazy conversion? Open issue (since 2008): https://issues.apache.org/jira/browse/PIG-410

Apply Rule 5 to TPC-H Query 6 Q6 reads one table, applies some filters and returns a global aggregation. Pig is slower than Hive due to the aggregation. See next rule.

Rule 6: Use hash-based aggregation Sort-based aggregation is expensive due to sorting, spilling, shuffling, etc. Hash-based aggregation keeps a hash table inside Map Hive supports this already Pig is going to support it soon!

Query 1 (Rule 6 will be applicable soon) Q1 has a group-by and several aggregations.

Six Rules Summary Choose a better query plan for Pig, especially the order of joins Making full use of Pig’s features, like COGROUP, FLATTEN, etc Be aware of Pig’s current issues, such as projection, type conversions, sort-based aggregation

All rewritten queries based on Rule 1~5

Updated Result

Acknowledgement We referred to six Pig scripts used in Query optimization for massively parallel data processing (SOCC '11) We appreciate Amazon EC2’s education grants All scripts are available at https://issues.apache.org/jira/browse/PIG-2397