The Idea of Pig Or Pig Concepts

Slides:



Advertisements
Similar presentations
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Advertisements

Overview of this week Debugging tips for ML algorithms
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Hive: A data warehouse on Hadoop
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
Big Data Analytics Training
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 5 Data Resource Management.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.
Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
BIG DATA/ Hadoop Interview Questions.
MapReduce Compilers-Apache Pig
Tim Hall Oracle ACE Director
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Integrating QlikView with MPP data sources
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MSBIC Hadoop Series Processing Data with Pig
Spark Presentation.
A Warehousing Solution Over a Map-Reduce Framework
Scaling SQL with different approaches
Operational & Analytical Database
Hadoop.
Hive Mr. Sriram
SQOOP.
Central Florida Business Intelligence User Group
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Data flow language (abstraction for MR jobs)
Introduction to Spark.
Pig Data flow language (abstraction for MR jobs)
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cse 344 May 2nd – Map/reduce.
Chapter 5 Data Resource Management.
Slides borrowed from Adam Shook
Distributed System Gang Wu Spring,2018.
Overview of big data tools
Pig from Alan Gates’ book (In preparation for exam2)
CSE 491/891 Lecture 21 (Pig).
Pig Data flow language (abstraction for MR jobs)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
(Hadoop) Pig Dataflow Language
(Hadoop) Pig Dataflow Language
04 | Processing Big Data with Pig
Data Wrangling for ETL enthusiasts
Big Data Technology: Introduction to Hadoop
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Pig Hive HBase Zookeeper
Presentation transcript:

The Idea of Pig Or Pig Concepts B. Ramamurthy 1/2/2019

References http://pig.apache.org/ http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html 1/2/2019

What is Pig?: example Pig is a scripting language that helps in designing big data solutions using high level primitives. Pig script can be executed locally; it is typically translated into MR job/task workflow and executed on Hadoop Pig itself is a MR job on Hadoop You can access local file system using pig –x local (eg. file:/…) Other file system accessible are hdfs:// and s3:// from grunt> of non-local pig You can transfer data into local file system from s3: hadoop dfs –copyToLocal s3n://cse487/pig1/ps5.pig /home/hadoop/pig1/ps5.pig hadoop dfs –copyToLocal s3n://cse487/pig1/data2 /home/hadoop/pig1/data2 Then run ps5.pig in the local mode pig –x local run ps5.pig 1/2/2019

1/2/2019

Simple pig scripts: wordcount A = load 'data2' as (line); words = foreach A generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); store cntd into 'pig1out'; 1/2/2019

Sample Pig script: simple data analysis 2 4 5 -2 3 4 3 5 6 -4 5 7 -7 4 6 4 5 A = LOAD 'data3' AS (x,y,z); B = FILTER A by x> 0; C = GROUP B BY x; D = FOREACH C GENERATE group,COUNT(B); STORE D INTO 'p6out'; 1/2/2019

See the pattern? LOAD FILTER GROUP GENERATE (apply some function from piggybank) STORE (DUMP for interactive debugging) 1/2/2019

Pig Latin Is the language pig script is written in. Is a parallel data flow language Mathematically pig latin describes a directed acyclic graph (DAG) where edges are data flow and the nodes are operators that process data It is data flow not control flow language: no if statements and for loops! (traditional OO programming describes control flow not data flow.) 1/2/2019

Pig and query language How about Pig and SQL? SQL describes “what” or what is the user’s question and it does NOT describes how it is to be solved. SQL is built around answering one question: lots of subqueries and temporary tables resulting in one thing: inverted process remember from our earlier discussions if these temp table are NOT in-memory their random access is expensive Pig describes the data pipeline from first step to final step. HDFS vs RDBMS Tables 1/2/2019

SQL (vs. Pig) CREATE TEMP TABLE t1 AS SELECT customer, sum(purchase) AS total_purchases FROM transactions GROUP BY customer; SELECT customer, total_purchases,zipcode FROM t1, customer_profile WHERE t1.customer = customer_profile.customer; 1/2/2019

(SQL vs.) Pig txns = load ‘transactions’ as (customer, purchase) grouped = group txns customer; total = foreach grouped generate group, SUM(txns.purchase) as tp; profile = load ‘customer_profile’ as (customer, zipcode); answer = join total by group, profile by customer; dump answer; 1/2/2019

Pig and HDFS and MR Pig does not require HDFS. Pig can run on any file system as long as you transfer the data flow and the data appropriately. This is great since you can use not just file:// or hdfs:// but also other systems to be developed in the future. Similarly Pig Latin has several advantages over MR (see chapter 1 Programming Pig book) 1/2/2019

Uses of Pig Traditional Extract, Transform, Load (ETL) data pipelines Research on raw data Iterative processing Prototyping (debugging) on small data and local system before launching a big data, multi-node MR jobs Largest use case: data pipelines: raw data , cleanse, load into data warehouse Ad-hoc queries from data where the scheme is unknown What it is not good for? For workloads that will update a few records, the will look up data in some random order, Pig is not a good choice. In 2009, 50% yahoo! Jobs executed were using Pig. Lets execute some Pig scripts on amazon installation. 1/2/2019