Unit 5 Working with pig.

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

Cookies, Sessions. Server Side Includes You can insert the content of one file into another file before the server executes it, with the require() function.
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
Your Name.  Recap  Advance  Built-In Function  UDF  Conclusion.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Lab6 – Debug Assembly Language Lab
Presented By: Imranul Hoque
Bigtable, Hive, and Pig Based on the slides by Jimmy Lin University of Maryland This work is licensed under a Creative Commons Attribution-Noncommercial-Share.
Direct Congress Dan Skorupski Dan Vingo 3 December 2008.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Using Unix Shell Scripts to Manage Large Data
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Gadgets & More…. “Date Range” Gadgets Allows you to choose a specific date, before or after a date or a range of dates using the Workflows calendar.
JavaScript, Fourth Edition
Big Data Analytics Training
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Creating Databases for web applications [Complete presentations] More SQL Class time: discuss final projects. Do posting if you have not done it.
SQL Report Writer.  The SQL Report Writer is included with every Appx runtime.  It is intended to be used by end users to create their own reports.
 Enhancing User Experience  Why it is important?  Discussing user experience one-by-one.
Design of Pig B. Ramamurthy. Pig’s data model Scalar types: int, long, float (early versions, recently float has been dropped), double, chararray, bytearray.
Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.
Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek Kim.
PANEL SENIOR BIG DATA ARCHITECT BD-COE
June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
NMD202 Web Scripting Week2. Web site
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Bigtable, Hive, and Pig Based on the slides by Jimmy Lin
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Project 1 : Who is Popular, and Who is Not.
Design of Pig B. Ramamurthy.
CC Procesamiento Masivo de Datos Otoño Lecture 5: Hadoop III / PIG
Big Data Analytics: HW#3
Projects on Extended Apache Spark
Keyboard Input and Screen Display ––––––––––– Interactive Programming
Pig Latin - A Not-So-Foreign Language for Data Processing
Load Model Data Tool Development Update
Guide To UNIX Using Linux Third Edition
Slides borrowed from Adam Shook
Pig from Alan Gates’ book (In preparation for exam2)
Web DB Programming: PHP
The Idea of Pig Or Pig Concepts
CSE 491/891 Lecture 21 (Pig).
VI-SEEM data analysis service
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
(Hadoop) Pig Dataflow Language
Hadoop – PIG.
(Hadoop) Pig Dataflow Language
04 | Processing Big Data with Pig
Pig Hive HBase Zookeeper
Presentation transcript:

Unit 5 Working with pig

Purpose Get familiar with the pig environment Advanced features Walk though some examples

Pig environment Installed in nimbus17:/usr/local/pig Current version 0.9.2 Web site: pig.apache.org Setup your path Already done, check your .profile Copy the sample codes/data from /home/hadoop/pig/examples

Two modes to run pig Interactive modes Local: “pig –x local” Hadoop: “pig –x mapreduce”, or just “pig” batch mode: all commands in one script file. Local: “pig –x local your_script” Hadoop: “pig your_script”

Comments /* */ for multiple lines -- for single line

First simple program id.pig A = load ‘/etc/passwd' using PigStorage(':'); -- load the passwd file B = foreach A generate $0 as id; -- extract the user IDs store B into ‘id.out’; -- write the results to a file name id.out Test run it with interactive mode and batch mode

2nd program: student.pig A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; DUMP B; ---------------------------------- Dump for debugging Store for final output

Built-in functions Eval functions Load/Store functions Math functions String functions Type conversion functions

UDF Java Python or Jython Javascript Ruby Piggy bank – a library of user contributed UDF

UDF example Compile it: cd myudfs javac -cp pig.jar UPPER.java cd .. jar -cf myudfs.jar myudfs

UDF: aggregate function Long

UDF: FilterFunc B = FILTER A BY isEmpty(A.bagfield);

Check http://pig.apache.org/docs/r0.13.0/udf.html#udf-java for more java examples.

Python UDFs Only works for Hadoop version <2.0 test.py How to use it

3rd program: script1-local.pig Query phrase popularity processes a search query log file from the Excite search engine and finds search phrases (ngrams) that occur with particular high frequency during certain times of the day. How to use UDFs YYMMDDHHMMSS query cookie

1. Register the tutorial JAR file so that the included UDFs can be called in the script. REGISTER ./tutorial.jar; 2. Use the PigStorage function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields user, time, and query. raw = LOAD 'excite-small.log' USING PigStorage('\t') AS (user, time, query); 3. Call the NonURLDetector UDF to remove records if the query field is empty or a URL. clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); 4. Call the ToLower UDF to change the query field to lowercase. clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query;

5. Because the log file only contains queries for a single day, we are only interested in the hour. The excite query log timestamp format is YYMMDDHHMMSS. Call the ExtractHour UDF to extract the hour (HH) from the time field. houred = FOREACH clean2 GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, query; 6. Call the NGramGenerator UDF to compose the n-grams of the query. ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram; 7. Use the DISTINCT operator to get the unique n-grams for all records. ngramed2 = DISTINCT ngramed1; 8. Use the GROUP operator to group records by n-gram and hour. hour_frequency1 = GROUP ngramed2 BY (ngram, hour);

9. Use the COUNT function to get the count (occurrences) of each n-gram. hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count; 10. Use the GROUP operator to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour. uniq_frequency1 = GROUP hour_frequency2 BY group::ngram; 11. For each group, identify the hour in which this n-gram is used with a particularly high frequency. Call the ScoreGenerator UDF to calculate a "popularity" score for the n-gram. uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1)); 12. Use the FOREACH-GENERATE operator to assign names to the fields. uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;

13. Use the FILTER operator to remove all records with a score less than or equal to 2.0. filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0; 14. Use the ORDER operator to sort the remaining records by hour and score. ordered_uniq_frequency = ORDER filtered_uniq_frequency BY hour, score; 15. Use the PigStorage function to store the results. The output file contains a list of n-grams with the following fields: hour, ngram, score, count, mean. STORE ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage();

4th program: Script2-local.pig Temporal query phrase popularity processes a search query log file from the Excite search engine and compares the occurrence of frequency of search phrases across two time periods separated by twelve hours. Use Join