Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Slides:

Advertisements

Similar presentations

Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.

Advertisements

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Hadoop Pig By Ravikrishna Adepu.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.

Presented By: Imranul Hoque

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Midterm Review Lecture 14b. 14 Lectures So Far 1.Introduction 2.The Relational Model 3.Disks and Files 4.Relational Algebra 5.File Org, Indexes 6.Relational.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.

Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

HADOOP ADMIN: Session -2

Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Cloud Computing Other High-level parallel processing languages Keke Chen.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.

Big Data Analytics Training

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

An Introduction to HDInsight June 27 th,

Instructor: Jinze Liu Fall Basic Components (2) Relational Database Web-Interface Done before mid-term Must-Have Components (2) Security: access.

RESTORE IMPLEMENTATION as an extension to pig Vijay S.

Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.

SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.

CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

Data Cleansing with Pig Latin. Neubot Tests Data Structure.

MapReduce Compilers-Apache Pig

More SQL: Complex Queries, Triggers, Views, and Schema Modification

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Relational Algebra - Part 1

Pig : Building High-Level Dataflows over Map-Reduce

Pig Latin - A Not-So-Foreign Language for Data Processing

Relational Algebra Chapter 4, Part A

Pig Latin: A Not-So-Foreign Language for Data Processing

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Overview of big data tools

Pig : Building High-Level Dataflows over Map-Reduce

CSE 491/891 Lecture 21 (Pig).

Charles Tappert Seidenberg School of CSIS, Pace University

(Hadoop) Pig Dataflow Language

(Hadoop) Pig Dataflow Language

Pig and pig latin: An Introduction

Pig Hive HBase Zookeeper

Presentation transcript:

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement : Modified the slides from University of Waterloo

Motivation  You ‟ re a procedural programmer  You have huge data  You want to analyze it 2

Motivation  As a procedural programmer…  May find writing queries in SQL unnatural and too restrictive  More comfortable with writing code; a series of statements as opposed to a long query. (Ex: MapReduce is so successful). 3

Motivation  Data analysis goals  Quick  Exploit parallel processing power of a distributed system  Easy  Be able to write a program or query without a huge learning curve  Have some common analysis tasks predefined  Flexible  Transform a data set(s) into a workable structure without much overhead  Perform customized processing  Transparent  Have a say in how the data processing is executed on the system 5

Motivation  Relational Distributed Databases  Parallel database products expensive  Rigid schemas  Processing requires declarative SQL query construction  Map-Reduce  Relies on custom code for even common operations  Need to do workarounds for tasks that have different data flows other than the expected Map  Combine  Reduce 6

Motivation  Relational Distributed Databases  Sweet Spot: Take the best of both SQL and Map-Reduce; combine high-level declarative querying with low-level procedural programming…Pig Latin!  Map-Reduce 7

Pig Latin Example Table urls: (url,category, pagerank) Find for each suffciently large category, the average pagerank of high-pagerank urls in that category SQL: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10^6 Pig Latin: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Outline  System Overview  Pig Latin (The Language)  Data Structures  Commands  Pig (The Compiler)  Logical & Physical Plans  Optimization  Efficiency  Pig Pen (The Debugger)  Conclusion 8

Big Picture Pig Latin Script User- Defined Functions Pig Map-Reduce Statements Compile Optimize Write ResultsRead Data 10

Data Model  Atom - simple atomic value (ie: number or string)  Tuple  Bag  Map 11

Data Model  Atom  Tuple - sequence of fields; each field any type  Bag  Map 12

Data Model  Atom  Tuple  Bag - collection of tuples  Duplicates possible  Tuples in a bag can have different field lengths and field types  Map 13

Data Model  Atom  Tuple  Bag  Map - collection of key-value pairs  Key is an atom; value can be any type 14

Data Model  Control over dataflow Ex 1 (less efficient) spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; Ex 2 (most efficient) highpgr_urls = FILTER urls BY pagerank > 0.8; spam_urls = FILTER highpgr_urls BY isSpam(url);  Fully nested  More natural for procedural programmers (target user) than normalization  Data is often stored on disk in a nested fashion  Facilitates ease of writing user-defined functions  No schema required 15

Data Model  User-Defined Functions (UDFs)  Can be used in many Pig Latin statements  Useful for custom processing tasks  Can use non-atomic values for input and output  Currently must be written in Java 16  Ex: spam_urls = FILTER urls BY isSpam(url);

Speaking Pig Latin  LOAD  Input is assumed to be a bag (sequence of tuples)  Can specify a deserializer with “USING ‟  Can provide a schema with “AS ‟ newBag = LOAD ‘ filename ’ ; 17 Queries = LOAD ‘ query_log.txt ’ USING myLoad() AS ( userID,queryString, timeStamp)

Speaking Pig Latin  FOREACH  Apply some processing to each tuple in a bag  Each field can be:  A fieldname of the bag  A constant  A simple expression (ie: f1+f2)  A predefined function (ie: SUM, AVG, COUNT, FLATTEN)  A UDF (ie: sumTaxes(gst, pst) ) newBag = FOREACH bagName GENERATE field1, field2, …; 18

Speaking Pig Latin  FILTER  Select a subset of the tuples in a bag newBag = FILTER bagName BY expression ;  Expression uses simple comparison operators (==, !=,, …) and Logical connectors (AND, NOT, OR) some_apples = FILTER apples BY colour != ‘red’;  Can use UDFs some_apples = FILTER apples BY NOT isRed(colour); 19

Speaking Pig Latin  COGROUP  Group two datasets together by a common attribute  Groups data into nested bags grouped_data = COGROUP results BY queryString, revenue BY queryString; 20

Speaking Pig Latin  Why COGROUP and not JOIN? url_revenues = FOREACH grouped_data GENERATE FLATTEN(distributeRev(results, revenue)); 21

Speaking Pig Latin  Why COGROUP and not JOIN?  May want to process nested bags of tuples before taking the cross product.  Keeps to the goal of a single high-level data transformation per pig-latin statement.  However, JOIN keyword is still available: JOIN results BY queryString, revenue BY queryString; Equivalent temp = COGROUP results BY queryString, revenue BY queryString; join_result = FOREACH temp GENERATE FLATTEN(results), FLATTEN(revenue); 22

Speaking Pig Latin  STORE (& DUMP)  Output data to a file (or screen) STORE bagName INTO ‘ filename ’ ;  Other Commands (incomplete)  UNION - return the union of two or more bags  CROSS - take the cross product of two or more bags  ORDER - order tuples by a specified field(s)  DISTINCT - eliminate duplicate tuples in a bag  LIMIT - Limit results to a subset 23

Compilation  Pig system does two tasks:  Builds a Logical Plan from a Pig Latin script  Supports execution platform independence  No processing of data performed at this stage  Compiles the Logical Plan to a Physical Plan and Executes  Convert the Logical Plan into a series of Map-Reduce statements to be executed (in this case) by Hadoop Map-Reduce 24

Compilation  Building a Logical Plan  Verify input files and bags referred to are valid  Create a logical plan for each bag(variable) defined 25

Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 26

Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; 27

Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach 28

Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Group OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Foreach Filter 29

Compilation  Building a Logical Plan Example A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Foreach 30

Compilation  Building a Physical Plan A = LOAD ‘user.dat’ AS (name, age, city); Load(user.dat) B = GROUP A BY city; C = FOREACH B GENERATE group AS city, COUNT(A); D = FILTER C BY city IS ‘kitchener’ Filter OR city IS ‘waterloo’; STORE D INTO ‘local_user_count.dat’; Group Only happens when output is specified by STORE or DUMP Foreach 32

Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP Map Reduce Load(user.dat) Filter Group Foreach 33

Compilation  Building a Physical Plan  Step 1: Create a map-reduce job for each COGROUP  Step 2: Push other commands into the map and reduce functions where Map possible  May be the case certain commands require their own map-reduce Reduce Load(user.dat) Filter Group job (ie: ORDER needs separate mapreduce jobs) Foreach 34

Compilation  Efficiency in Execution  Parallelism  Loading data - Files are loaded from HDFS  Statements are compiled into map-reduce jobs 35

Compilation  Efficiency with Nested Bags  In many cases, the nested bags created in each tuple of a COGROUP statement never need to physically materialize  Generally perform aggregation after a COGROUP and the statements for said aggregation are pushed into the reduce function  Applies to algebraic functions (ie: COUNT, MAX, MIN, SUM, AVG) 36

Compilation  Efficiency with Nested Bags Map Load(user.dat) Filter Group Foreach 37

Compilation  Efficiency with Nested Bags Load(user.dat) Filter Group Combine Foreach 38

Compilation  Efficiency with Nested Bags Reduce Load(user.dat) Filter Group Foreach 39

Compilation  Efficiency with Nested Bags  Why this works:  COUNT is an algebraic function; it can be structured as a tree of sub- functions with each leaf working on a subset of the data Reduce SUM CombineCOUNT 40

Compilation  Efficiency with Nested Bags  Pig provides an interface for writing algebraic UDFs so they can take advantage of this optimization as well.  Inefficiencies  Non-algebraic aggregate functions (ie: MEDIAN) need entire bag to materialize; may cause a very large bag to spill to disk if it doesn ‟ t fit in memory  Every map-reduce job requires data be written and replicated to the HDFS (although this is offset by parallelism achieved) 41

Debugging  How to verify the semantics of an analysis program  Run the program against whole data set. Might take hours!  Generate sample dataset  Empty result set may occur on few operations like join, filter  Generally, testing with sample dataset is difficult  Pig-Pen  Samples data from large dataset for Pig statements  Apply individual Pig-Latin commands against the dataset  In case of empty result, pig system resamples  Remove redundant samples 42

Debugging  Pig-Pen 42

Debugging  Pig-Latin command window and command generator 43

Debugging  Sand Box Dataset (generated automatically!) 44

Debugging  Pig-Pen  Provides sample data that is:  Real - taken from actual data  Concise - as small as possible  Complete - collectively illustrate the key semantics of each command  Helps with schema definition  Facilitates incremental program writing 45

Conclusion  Pig is a data processing environment in Hadoop that is specifically targeted towards procedural programmers who perform large-scale data analysis.  Pig-Latin offers high-level data manipulation in a procedural style.  Pig-Pen is a debugging environment for Pig-Latin commands that generates samples from real data. 47

More Info  Pig,  Hadoop, Anks- Thay! 48