Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Idea of Pig Or Pig Concepts

Similar presentations


Presentation on theme: "The Idea of Pig Or Pig Concepts"— Presentation transcript:

1 The Idea of Pig Or Pig Concepts
B. Ramamurthy 1/2/2019

2 References http://pig.apache.org/
1/2/2019

3 What is Pig?: example Pig is a scripting language that helps in designing big data solutions using high level primitives. Pig script can be executed locally; it is typically translated into MR job/task workflow and executed on Hadoop Pig itself is a MR job on Hadoop You can access local file system using pig –x local (eg. file:/…) Other file system accessible are hdfs:// and s3:// from grunt> of non-local pig You can transfer data into local file system from s3: hadoop dfs –copyToLocal s3n://cse487/pig1/ps5.pig /home/hadoop/pig1/ps5.pig hadoop dfs –copyToLocal s3n://cse487/pig1/data2 /home/hadoop/pig1/data2 Then run ps5.pig in the local mode pig –x local run ps5.pig 1/2/2019

4 1/2/2019

5 Simple pig scripts: wordcount
A = load 'data2' as (line); words = foreach A generate flatten(TOKENIZE(line)) as word; grpd = group words by word; cntd = foreach grpd generate group, COUNT(words); store cntd into 'pig1out'; 1/2/2019

6 Sample Pig script: simple data analysis
2 4 5 -2 3 4 3 5 6 -4 5 7 -7 4 6 4 5 A = LOAD 'data3' AS (x,y,z); B = FILTER A by x> 0; C = GROUP B BY x; D = FOREACH C GENERATE group,COUNT(B); STORE D INTO 'p6out'; 1/2/2019

7 See the pattern? LOAD FILTER GROUP
GENERATE (apply some function from piggybank) STORE (DUMP for interactive debugging) 1/2/2019

8 Pig Latin Is the language pig script is written in.
Is a parallel data flow language Mathematically pig latin describes a directed acyclic graph (DAG) where edges are data flow and the nodes are operators that process data It is data flow not control flow language: no if statements and for loops! (traditional OO programming describes control flow not data flow.) 1/2/2019

9 Pig and query language How about Pig and SQL?
SQL describes “what” or what is the user’s question and it does NOT describes how it is to be solved. SQL is built around answering one question: lots of subqueries and temporary tables resulting in one thing: inverted process remember from our earlier discussions if these temp table are NOT in-memory their random access is expensive Pig describes the data pipeline from first step to final step. HDFS vs RDBMS Tables 1/2/2019

10 SQL (vs. Pig) CREATE TEMP TABLE t1 AS SELECT customer, sum(purchase) AS total_purchases FROM transactions GROUP BY customer; SELECT customer, total_purchases,zipcode FROM t1, customer_profile WHERE t1.customer = customer_profile.customer; 1/2/2019

11 (SQL vs.) Pig txns = load ‘transactions’ as (customer, purchase) grouped = group txns customer; total = foreach grouped generate group, SUM(txns.purchase) as tp; profile = load ‘customer_profile’ as (customer, zipcode); answer = join total by group, profile by customer; dump answer; 1/2/2019

12 Pig and HDFS and MR Pig does not require HDFS.
Pig can run on any file system as long as you transfer the data flow and the data appropriately. This is great since you can use not just file:// or hdfs:// but also other systems to be developed in the future. Similarly Pig Latin has several advantages over MR (see chapter 1 Programming Pig book) 1/2/2019

13 Uses of Pig Traditional Extract, Transform, Load (ETL) data pipelines
Research on raw data Iterative processing Prototyping (debugging) on small data and local system before launching a big data, multi-node MR jobs Largest use case: data pipelines: raw data , cleanse, load into data warehouse Ad-hoc queries from data where the scheme is unknown What it is not good for? For workloads that will update a few records, the will look up data in some random order, Pig is not a good choice. In 2009, 50% yahoo! Jobs executed were using Pig. Lets execute some Pig scripts on amazon installation. 1/2/2019


Download ppt "The Idea of Pig Or Pig Concepts"

Similar presentations


Ads by Google