CSE 491/891 Lecture 21 (Pig).

Slides:



Advertisements
Similar presentations
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Advertisements

Hadoop Pig By Ravikrishna Adepu.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Introduction to Structured Query Language (SQL)
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
HADOOP ADMIN: Session -2
Relational DBs and SQL Designing Your Web Database (Ch. 8) → Creating and Working with a MySQL Database (Ch. 9, 10) 1.
PHP Programming with MySQL Slide 8-1 CHAPTER 8 Working with Databases and MySQL.
Big Data Analytics Training
Chapter 7 Working with Databases and MySQL PHP Programming with MySQL 2 nd Edition.
Database-Driven Web Sites, Second Edition1 Chapter 5 WEB SERVERS.
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Database control Introduction. The Database control is a tool that used by the database administrator to control the database. To enter to Database control.
Oracle Data Integrator Procedures, Advanced Workflows.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Aliya Farheen October 29,2015.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
D Copyright © 2009, Oracle. All rights reserved. Using SQL*Plus.
MapReduce Compilers-Apache Pig
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Business rules.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Using SQL*Plus.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
MSBIC Hadoop Series Processing Data with Pig
Basics on DB access Elke A. Rundensteiner.
Spark Presentation.
Database Systems: Design, Implementation, and Management Tenth Edition
Introduction to Programming the WWW I
Pig Latin - A Not-So-Foreign Language for Data Processing
Using SQL*Plus.
ORACLE SQL Developer & SQLPLUS Statements
Chapter 7 Working with Databases and MySQL
Chapter 8 Working with Databases and MySQL
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Slides borrowed from Adam Shook
Overview of big data tools
The Idea of Pig Or Pig Concepts
Pig - Hive - HBase - Zookeeper
Setup Sqoop.
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Using SQL*Plus.
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
CSE 491/891 Lecture 25 (Mahout).
(Hadoop) Pig Dataflow Language
Database Systems: Design, Implementation, and Management Tenth Edition
DocumentParser: November, 2013.
(Hadoop) Pig Dataflow Language
04 | Processing Big Data with Pig
Big Data Technology: Introduction to Hadoop
LOAD ,DUMP,DESCRIBE operators
ReStore: Reusing Results of MapReduce Jobs
Pig Hive HBase Zookeeper
Presentation transcript:

CSE 491/891 Lecture 21 (Pig)

What is Pig? Pig is a Hadoop extension that simplifies programming by providing a high-level data processing language on top of Hadoop Created at Yahoo! to make it easier for researchers and engineers to process massive datasets Main use of Pig is to help users transform data or compute summary statistics from the data

What is Pig Two major components in Pig A high-level data flow language called Pig Latin A Pig Latin program specifies a sequence of steps for processing the input data A compiler that compiles and runs the Pig Latin script in an execution environment There are currently two execution modes: Local mode Pig runs on a single JVM and accesses the local filesystem Distributed mode Pig translates queries into MapReduce jobs and runs them on a Hadoop cluster

What can Pig Latin Do? It provides commands to interact with HDFS It allows you to manipulate data stored in HDFS It allows you to select certain attributes It allows you to apply aggregate functions It allows you to join data from different “tables” In other words, you can manipulate the data just like what SQL does, except you’re working with HDFS (instead of relational db) Similar operators but different language than SQL

How to Use Pig (I) By entering commands directly into the grunt interactive shell

How to Run Pig (II) By using a script file (with extension *.pig) Step 1: Create a Pig script file Step 2: Execute the script by typing pig <script-file>

How to Run Pig (III) By embedding Pig queries in Java programs

Using Pig on AWS EMR Pig is installed on AWS EMR cluster and hadoop2.cse.msu.edu Important: when you create AWS EMR cluster, make sure you choose the software configuration that includes Pig as one of its applications (see the next slide)

EMR Software Configuration

Grunt Shell Commands To invoke the shell, type pig –x local (run in local mode) pig –x mapreduce (run in distributed mode; default) If you encounter the file not found error, you should provide the full path name (/usr/bin/pig) to run pig: or Note: After launching the cluster, it may take awhile before pig is loaded; so you need to wait for sometime before pig can be executed on the EMR cluster

Example of Invoking Grunt Shell When working in local mode, you’ll be accessing the local filesystem

Example of Invoking Grunt Shell When working in mapreduce mode, you’ll be accessing the HDFS

Disabling Logging info of Pig Console Create a file called nolog.conf (which has only 1 line of code) Include nolog.config file when invoking pig Replace this with the actual location of your nolog.conf file

Grunt Shell Commands To exit the grunt shell: To get help: grunt> quit To get help: grunt> help

Grunt Shell Commands You can run HDFS commands by Typing fs <HDFScommand> in the Grunt shell

Grunt Shell Commands You can also execute Pig scripts within Grunt exec <script-file> The Pig script will be executed in a separate workspace from the Grunt shell (so aliases in the script are not visible to the shell and vice- versa) run <script-file> The Pig script will be executed in the same workspace as Grunt; equivalent to typing each line of the script into the Grunt shell

Pig Latin A high-level scripting language that allows users to manipulate large-scale data stored in HDFS In this lecture, assume Pig is runn in distributed mode Summary of Pig Latin syntax and commands Read-write from/to HDFS Data types Diagnostic Expressions and functions Relational operators (UNION, JOIN, FILTER, etc) Note: no commands for INSERT, DELETE, UPDATE

Typical WorkFlow of a Pig Latin Program Load data from HDFS into an alias alias = LOAD filename AS (…) Manipulate the alias using relational operators, functions, etc Each manipulation creates a new alias new_alias = pig_command(old_alias) dump alias to display it on the Grunt shell or store alias in a HDFS directory (if distributed mode)

Read-Write Operations

LOAD Default: assumes input data is tab-separated mydata = LOAD ‘input.txt’ AS (attr1, attr2, …); If data is comma-separated, you can use the built-in PigStorage() function to parse the file mydata = LOAD ‘input.txt’ USING PigStorage(‘,’) AS (attr1, attr2, …); You can also define the attribute types mydata = LOAD ‘input.txt’ USING PigStorage(‘,’) AS (attr1:chararray, attr2:int, …);

Example for Wiki Edits Source file on HDFS: wiki_edit.txt Suppose we want to count the number of edits for each article How to do this in SQL (assuming the data is stored in a table on MySQL)? How to do this in Pig Latin (assuming the data is stored in HDFS)?

SQL Example for Wiki Edits Source file on HDFS: wiki_edit.txt Assume schema for MySQL table: Wiki_Edit(RevID, Article, TS, UName) SQL for counting number of edits per article: SELECT data.Article, Count(*) FROM Wiki_Edit AS data GROUP BY data.Article LIMIT 4; Alias Display first 4 rows

Pig Latin Example for Wiki Edits Equivalent to SQL query: SELECT data.article, Count(*) FROM Wiki_Edit AS data GROUP BY data.article LIMIT 4;

DUMP, STORE, and LIMIT DUMP prints out the content of an alias STORE saves the content of an alias to a file STORE counts INTO ‘output’ STORE counts INTO ‘output2’ using PigStorage(‘,’) LIMIT allows you to specify the number of tuples (rows) to return

Atomic Data Types

Complex Data Types

Data Types A field in a tuple or a value in a map can be null or any atomic/complex type The latter enables nesting and complex data structures (John, {(48, Jolly Rd, Okemos), (10, Grand, East Lansing)}) If you load data without specifying the full schema If you leave out the field type, Pig will default to bytearray, which is the most generic type If you leave out its name, a field would be unnamed and you can only reference it by its position ($0, $1, $2, and so on)

Example for Complex Data Types Note: The tuples in a row are tab-separated

Example for Auto Data Auto.data (from lecture 12) Schema (http://archive.ics.uci.edu/ml/datasets/Automobile) 1:id, 2:make, 3:fuel_type, 4:std_or_turbo, 5:num_doors, 6:body_style, …, 25:price, 26:class

Example for Auto Data Load data without specifying the schema Get the make of vehicles (make is column #2) For each make, get the average price of vehicles (price is column #25)

Diagnostic operators in Pig Latin

DESCRIBE The field “data” in grp is a bag with subfields rev, article, ts, uname

EXPLAIN Execution Plan for processing the alias

ILLUSTRATE Available on hadoop2.cse.msu.edu Shows example of the transformation, i.e., how to go from original data -> grp -> counts

Expressions (I) Expressions are used with FILTER, FOREACH, GROUP, and SPLIT operators as well as eval functions (to be discussed in next lecture)

Expressions (II)

Pig’s Built-In Functions Important note: Pig function names are case-sensitive

Additional Notes on Pig For more examples and notes on Pig Latin, please refer to the following documentation https://pig.apache.org/docs/r0.11.1/basic.html http://chimera.labs.oreilly.com/books/1234000001811/ch05.html