CSE 491/891 Lecture 21 (Pig).

CSE 491/891 Lecture 21 (Pig)

What is Pig? Pig is a Hadoop extension that simplifies programming by providing a high-level data processing language on top of Hadoop Created at Yahoo! to make it easier for researchers and engineers to process massive datasets Main use of Pig is to help users transform data or compute summary statistics from the data

What is Pig Two major components in Pig
A high-level data flow language called Pig Latin A Pig Latin program specifies a sequence of steps for processing the input data A compiler that compiles and runs the Pig Latin script in an execution environment There are currently two execution modes: Local mode Pig runs on a single JVM and accesses the local filesystem Distributed mode Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster

What can Pig Latin Do? It provides commands to interact with HDFS
It allows you to manipulate data stored in HDFS It allows you to select certain attributes It allows you to apply aggregate functions It allows you to join data from different “tables” In other words, you can manipulate the data just like what SQL does, except you’re working with HDFS (instead of relational db) Similar operators but different language than SQL

How to Use Pig (I) By entering commands directly into the grunt interactive shell

How to Run Pig (II) By using a script file (with extension *.pig)
Step 1: Create a Pig script file Step 2: Execute the script by typing pig <script-file>

How to Run Pig (III) By embedding Pig queries in Java programs

Using Pig on AWS EMR Pig is installed on AWS EMR cluster and hadoop2.cse.msu.edu Important: when you create AWS EMR cluster, make sure you choose the software configuration that includes Pig as one of its applications (see the next slide)

EMR Software Configuration

Grunt Shell Commands To invoke the shell, type
pig –x local (run in local mode) pig –x mapreduce (run in distributed mode; default) If you encounter the file not found error, you should provide the full path name (/usr/bin/pig) to run pig: or Note: After launching the cluster, it may take awhile before pig is loaded; so you need to wait for sometime before pig can be executed on the EMR cluster

Example of Invoking Grunt Shell
When working in local mode, you’ll be accessing the local filesystem

Example of Invoking Grunt Shell
When working in mapreduce mode, you’ll be accessing the HDFS

Disabling Logging info of Pig Console
Create a file called nolog.conf (which has only 1 line of code) Include nolog.config file when invoking pig Replace this with the actual location of your nolog.conf file

Grunt Shell Commands To exit the grunt shell: To get help:
grunt> quit To get help: grunt> help

Grunt Shell Commands You can run HDFS commands by
Typing fs <HDFScommand> in the Grunt shell

Grunt Shell Commands You can also execute Pig scripts within Grunt
exec <script-file> The Pig script will be executed in a separate workspace from the Grunt shell (so aliases in the script are not visible to the shell and vice- versa) run <script-file> The Pig script will be executed in the same workspace as Grunt; equivalent to typing each line of the script into the Grunt shell

Pig Latin A high-level scripting language that allows users to manipulate large-scale data stored in HDFS In this lecture, assume Pig is runn in distributed mode Summary of Pig Latin syntax and commands Read-write from/to HDFS Data types Diagnostic Expressions and functions Relational operators (UNION, JOIN, FILTER, etc) Note: no commands for INSERT, DELETE, UPDATE

Typical WorkFlow of a Pig Latin Program
Load data from HDFS into an alias alias = LOAD filename AS (…) Manipulate the alias using relational operators, functions, etc Each manipulation creates a new alias new_alias = pig_command(old_alias) dump alias to display it on the Grunt shell or store alias in a HDFS directory (if distributed mode)

Read-Write Operations

LOAD Default: assumes input data is tab-separated
mydata = LOAD ‘input.txt’ AS (attr1, attr2, …); If data is comma-separated, you can use the built-in PigStorage() function to parse the file mydata = LOAD ‘input.txt’ USING PigStorage(‘,’) AS (attr1, attr2, …); You can also define the attribute types mydata = LOAD ‘input.txt’ USING PigStorage(‘,’) AS (attr1:chararray, attr2:int, …);

Example for Wiki Edits Source file on HDFS: wiki_edit.txt
Suppose we want to count the number of edits for each article How to do this in SQL (assuming the data is stored in a table on MySQL)? How to do this in Pig Latin (assuming the data is stored in HDFS)?

SQL Example for Wiki Edits
Source file on HDFS: wiki_edit.txt Assume schema for MySQL table: Wiki_Edit(RevID, Article, TS, UName) SQL for counting number of edits per article: SELECT data.Article, Count(*) FROM Wiki_Edit AS data GROUP BY data.Article LIMIT 4; Alias Display first 4 rows

Pig Latin Example for Wiki Edits
Equivalent to SQL query: SELECT data.article, Count(*) FROM Wiki_Edit AS data GROUP BY data.article LIMIT 4;

DUMP, STORE, and LIMIT DUMP prints out the content of an alias
STORE saves the content of an alias to a file STORE counts INTO ‘output’ STORE counts INTO ‘output2’ using PigStorage(‘,’) LIMIT allows you to specify the number of tuples (rows) to return

Atomic Data Types

Complex Data Types

Data Types A field in a tuple or a value in a map can be null or any atomic/complex type The latter enables nesting and complex data structures (John, {(48, Jolly Rd, Okemos), (10, Grand, East Lansing)}) If you load data without specifying the full schema If you leave out the field type, Pig will default to bytearray, which is the most generic type If you leave out its name, a field would be unnamed and you can only reference it by its position ($0, $1, $2, and so on)

Example for Complex Data Types
Note: The tuples in a row are tab-separated

Example for Auto Data Auto.data (from lecture 12)
Schema ( 1:id, 2:make, 3:fuel_type, 4:std_or_turbo, 5:num_doors, 6:body_style, …, 25:price, 26:class

Example for Auto Data Load data without specifying the schema
Get the make of vehicles (make is column #2) For each make, get the average price of vehicles (price is column #25)

Diagnostic operators in Pig Latin

DESCRIBE The field “data” in grp is a bag with subfields rev, article, ts, uname

EXPLAIN Execution Plan for processing the alias

ILLUSTRATE Available on hadoop2.cse.msu.edu
Shows example of the transformation, i.e., how to go from original data -> grp -> counts

Expressions (I) Expressions are used with FILTER, FOREACH, GROUP, and SPLIT operators as well as eval functions (to be discussed in next lecture)

Expressions (II)

Pig’s Built-In Functions
Important note: Pig function names are case-sensitive

Additional Notes on Pig
For more examples and notes on Pig Latin, please refer to the following documentation

CSE 491/891 Lecture 21 (Pig).

Similar presentations

Presentation on theme: "CSE 491/891 Lecture 21 (Pig)."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 491/891 Lecture 21 (Pig).

Similar presentations

Presentation on theme: "CSE 491/891 Lecture 21 (Pig)."— Presentation transcript:

Similar presentations

About project

Feedback