LOAD ,DUMP,DESCRIBE operators

Slides:

Advertisements

Similar presentations

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

Advertisements

Your Name.  Recap  Advance  Built-In Function  UDF  Conclusion.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.

Introduction to Structured Query Language (SQL)

(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.

Introduction to Structured Query Language (SQL)

Guide To UNIX Using Linux Third Edition

Introduction to Unix (CA263) Introduction to Shell Script Programming By Tariq Ibn Aziz.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Introduction to Shell Script Programming

Big Data Analytics Training

Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

1 Recitation 8. 2 Outline Goals of this recitation: 1.Learn about loading files 2.Learn about command line arguments 3.Review of Exceptions.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –

6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.

8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

Chapter Six Introduction to Shell Script Programming.

SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.

Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Query Processing – Implementing Set Operations and Joins Chap. 19.

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

MapReduce Compilers-Apache Pig

3 A Guide to MySQL.

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Pig, Making Hadoop Easy Alan F. Gates Yahoo!.

A Guide to SQL, Seventh Edition

Unit 5 Working with pig.

Module 2: Intro to Relational Model

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Topics Introduction to Repetition Structures

CC Procesamiento Masivo de Datos Otoño Lecture 5: Hadoop III / PIG

Pig Latin - A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing

Relational Operations

Topics Introduction to File Input and Output

Slides borrowed from Adam Shook

Pig from Alan Gates’ book (In preparation for exam2)

The Idea of Pig Or Pig Concepts

CSE 491/891 Lecture 21 (Pig).

CSE 491/891 Lecture 24 (Hive).

Chapter 2: Intro to Relational Model

Contents Preface I Introduction Lesson Objectives I-2

Implementation of Relational Operations

Chapter 2: Intro to Relational Model

Chapter 8 Advanced SQL.

Example of a Relation attributes (or columns) tuples (or rows)

(Hadoop) Pig Dataflow Language

Chapter 2: Intro to Relational Model

Database Systems: Design, Implementation, and Management Tenth Edition

Hadoop – PIG.

External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.

Topics Introduction to File Input and Output

Shelly Cashman: Microsoft Access 2016

(Hadoop) Pig Dataflow Language

04 | Processing Big Data with Pig

Presentation transcript:

LOAD ,DUMP,DESCRIBE operators

Employ data 001 Robin 22 newyork 002 BO 23 Kolkata 003 Maya 23 Tokyo 004 Sara 25 London 005 David 23 Bhuwaneshwar 006 Maggy 22 Chennai grunt> A = LOAD 'piginput/employ'; grunt> DUMP A; (001,Robin,22,newyork) (002,BO,23,Kolkata) (003,Maya,23,Tokyo) (004,Sara,25,London) (005,David,23,Bhuwaneshwar) (006,Maggy,22,Chennai) grunt> DESCRIBE A; Schema for A unknown. grunt>

Defining schema and Using function 001:Robin:22:newyork 002:BO:23:Kolkata 003 :Maya:23:Tokyo 004:Sara:25:London 005 :David:23:Bhuwaneshwar 006 :Maggy:22 :Chennai Employ data grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city); grunt> DUMP A; (1,Robin,22,newyork) (2,BO,23,Kolkata) (3,Maya,23,Tokyo) (4,Sara,25,London) (5,David,23,Bhuwaneshwar) (6,Maggy,22,Chennai) grunt> DESCRIBE A; A: {id: int,name: chararray,age: int,city: bytearray} If we miss out types of fields,then default type will be bytearray grunt> A= LOAD 'piginput/emply' USING PigStorage(':') AS (id,name:chararray,age:int,city); grunt> DESCRIBE A; A: {id: bytearray,name: chararray,age: int,city: bytearray} grunt>

GROUP and COGROUP operators

Grouping Employ by age grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city); grunt> B= GROUP A by age; grunt>DUMP B; (22,{(1,Robin,22,newyork),(6,Maggy,22,Chennai)}) (23,{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)}) (25,{(4,Sara,25,London)}) grunt> DESCRIBE B; B: {group: int,A: {(id: int,name: chararray,age: int,city: bytearray)}} grunt> Grouping Employ by ALL grunt>B = GROUP A ALL; (all,{(1,Robin,22,newyork),(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(4,Sara,25,London),(5,David,23,Bhuwaneshwar),(6,Maggy,22,Chennai)})

Grouping by Multiple fields Let us consider student data 001,Rajiv,Reddy,21,9848022337,Hyderabad 002,siddarth,Battacharya,22,9848022338,Kolkata 003,Rajesh,Khanna,21,9848022339,Hyderabad 004,Preethi,Agarwal,21,9848022330,Punei 005,Trupthi,Mohanthy,23,9848022336,Chennai 006,Archana,Mishra,23,9848022335,Chennai 007,Komal,Nayak,24,9848022334,trivendram 008,Bharathi,Nambiayar,24,9848022333,trivendram grunt> A= LOAD 'piginput/student' USING PigStorage(',') AS (id:int,firstname:chararray,lastname:chararray,age:int,phno:long,city:chararray); grunt> g1 = GROUP A by (age,city); grunt> DUMP g1; ((21,Punei),{(4,Preethi,Agarwal,21,9848022330,Punei)}) ((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)(3,Rajesh,Khanna,21,9848022339,Hyderabad)}) ((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)}) ((23,Chennai),{(5,Trupthi,Mohanthy,23,9848022336,Chennai),(6,Archana,Mishra,23,9848022335,Chennai)}) ((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)(8,Bharathi,Nambiayar,24,9848022333,trivendram)})

Grouping by Expression grunt>A = LOAD 'data' as (f1:chararray, f2:int, f3:int); grunt>DUMP A; (r1,1,2) (r2,2,1) (r3,2,8) (r4,4,4) In this example the tuples are grouped using an expression, f2*f3. grunt>X = GROUP A BY f2*f3; grunt>DUMP X; (2,{(r1,1,2),(r2,2,1)}) (16,{(r3,2,8),(r4,4,4)}) The COGROUP operator works more or less in the same way as the GROUP operator. The only difference is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

grunt> employ= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city); grunt> student= LOAD 'piginput/student' USING PigStorage(',') AS (id:int,firstname:chararray,lastname:chararray,age:int,phno:long,city:chararray); grunt> cogroup_data = COGROUP student by age, employ by age; grunt> DUMP cogroup_data (21,{(1,Rajiv,Reddy,21,9848022337,Hyderabad),(3,Rajesh,Khanna,21,9848022339,Hyderabad),(4,Preethi,Agarwal,21,9848022330,Punei)},{}) (22,{(2,siddarth,Battacharya,22,9848022338,Kolkata)},{(1,Robin,22,newyork),(6,Maggy,22,Chennai)}) (23,{(5,Trupthi,Mohanthy,23,9848022336,Chennai),(6,Archana,Mishra,23,9848022335,Chennai)},{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)}) (24,{(7,Komal,Nayak,24,9848022334,trivendram),(8,Bharathi,Nambiayar,24,9848022333,trivendram)},{}) (25,{},{(4,Sara,25,London)})

grunt> order_by_data = ORDER student BY age DESC; The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields. grunt> order_by_data = ORDER student BY age DESC; grunt> DUMP order_by_data; (7,Komal,Nayak,24,9848022334,trivendram) (8,Bharathi,Nambiayar,24,9848022333,trivendram) (5,Trupthi,Mohanthy,23,9848022336,Chennai) (6,Archana,Mishra,23,9848022335,Chennai) (2,siddarth,Battacharya,22,9848022338,Kolkata) (1,Rajiv,Reddy,21,9848022337,Hyderabad) (3,Rajesh,Khanna,21,9848022339,Hyderabad) (4,Preethi,Agarwal,21,9848022330,Punei) Sorting based on more than one field grunt> order_by_data = ORDER student BY $3,$1 ASC; grunt> DUMP order_by_data; (1,Rajiv,Reddy,21,9848022337,Hyderabad) (3,Rajesh,Khanna,21,9848022339,Hyderabad) (4,Preethi,Agarwal,21,9848022330,Punei) (2,siddarth,Battacharya,22,9848022338,Kolkata) (5,Trupthi,Mohanthy,23,9848022336,Chennai) (6,Archana,Mishra,23,9848022335,Chennai) (7,Komal,Nayak,24,9848022334,trivendram) (8,Bharathi,Nambiayar,24,9848022333,trivendram)

LIMIT Use the LIMIT operator to limit the number of output tuples. If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, the output will include all tuples in the relation. A particular set of tuples can be requested using the ORDER operator followed by LIMIT. grunt> X= LIMIT student 3; grunt> DUMP X; (1,Rajiv,Reddy,21,9848022337,Hyderabad) (2,siddarth,Battacharya,22,9848022338,Kolkata) (3,Rajesh,Khanna,21,9848022339,Hyderabad) DISTINCT Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not preserve the original order of the contents You cannot use DISTINCT on a subset of fields. grunt> A = LOAD 'data' USING PigStorage(' ') AS (a1:int,a2:int,a3:int); grunt> DUMP A; (8,3,4) (1,2,3) (4,3,3)

grunt> X = DISTINCT A; grunt> DUMP X; (1,2,3) (4,3,3) (8,3,4) FILTER Selects tuples from a relation based on some condition. Use the FILTER operator to work with tuples or rows of data. FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want grunt> X = FILTER A BY a3 == 3; grunt> X = FILTER A BY (a1 == 8) OR (NOT a2+a3 > a1)); grunt> DUMP X; grunt> DUMP X; (8,3,4) (1,2,3) (4,3,3)

CROSS Use the CROSS operator to compute the cross product (Cartesian product) of two or more relations. grunt>A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt>DUMP A; (1,2,3) (4,2,1) grunt>B = LOAD 'data2' AS (b1:int,b2:int); grunt>DUMP B; (2,4) (8,9) (1,3) In this example the cross product of relation A and B is computed. grunt>X = CROSS A, B; grunt>DUMP X; (1,2,3,2,4) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3)

as unordered bags of tuples. UNION Use the UNION operator to merge the contents of two or more relations. The UNION operator: • Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples. • Does not ensure (as databases do) that all tuples belongs to the same schema or that they have the same number of fields. Does not eliminate duplicate tuples. grunt>A = LOAD 'data' AS (a1:int,a2:int,a3:int); grunt>DUMP A; (1,2,3) (4,2,1) grunt>B = LOAD 'data' AS (b1:int,b2:int); grunt>DUMP B; (2,4) (8,9) (1,3) grunt>X = UNION A, B; grunt>DUMP X;

• A tuple may be assigned to more than one relation. SPLIT Use the SPLIT operator to partition the contents of a relation into two or more relations based on some expression. Depending on the conditions stated in the expression: • A tuple may be assigned to more than one relation. • A tuple may not be assigned to any relation. SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ...]; In this example relation A is split into three relations, X, Y, and Z. grunt>A = LOAD 'data' AS (f1:int,f2:int,f3:int); grunt>DUMP A; (1,2,3) (4,5,6) (7,8,9) grunt>SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); grunt>DUMP X; grunt>DUMP Y; grunt>DUMP Z;

FLATTEN FLATTEN will eliminate a level of nesting by un-nesting tuples as well as bags. It will convert innerbags in to tuples For example, consider a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c). grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (v1:int,v2:int,v3:int); grunt> Result = GROUP A BY v1; grunt> DUMP Result; (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) grunt> X = FOREACH Result GENERATE group, FLATTEN(A); (1,1,2,3) (4,4,2,1) (4,4,3,3) (7,7,2,5) (8,8,3,4) (8,8,4,3)

Another FLATTEN example Another FLATTEN example.(Obtaining Group number and any particular value) grunt> X = FOREACH Result GENERATE group, FLATTEN(A.v3); (1,3) (4,1) (4,3) (7,5) (8,4) (8,3) Another FLATTEN example.(Obtaining Group number and any two values) grunt> X = FOREACH Result GENERATE group, FLATTEN(A.(v1,v3)); grunt>DUMP X; (1,1,3) (4,4,1) (4,4,3) (7,7,5) (8,8,4) (8,8,3)

work with tuples or rows of data, use the FILTER operation). FOREACH Use the FOREACH ...GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use the FILTER operation). grunt>A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt>DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) grunt>X = FOREACH A GENERATE *; grunt>DUMP X; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)

:~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/in piginput/in :~/hadoop-1.2.1$ pig -x mapreduce grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (a1:int,a2:int,a3:int); grunt> DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)

grunt> X = FOREACH A GENERATE a1, a2; grunt> STORE X INTO 'output"; Two fields in relation A are summed to form relation X. A schema is defined for the projected field. grunt> X = FOREACH A GENERATE a1+a2 AS f1:int >> ; grunt> DESCRIBE X; X: {f1: int} grunt> STORE X INTO 'output';

Obtaining Sum of Group fields grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (a1:int,a2:int,a3:int); grunt> w1= GROUP A by a1; grunt> X = FOREACH w1 { >> Y=FOREACH A GENERATE a1+a2+a3 AS f1:int; >> GENERATE group,Y; >> }; grunt> DUMP X; (1,{(6)}) (4,{(7),(10)}) (7,{(14)}) (8,{(15),(15)}) Obtaining Sum of Group fields whose sum >10 grunt> X = FOREACH w1 { >> Y=FOREACH A GENERATE a1+a2+a3 AS f1:int; >> Z= FILTER Y BY (f1 > 10); >> GENERATE group,Z; >> }; grunt> DUMP X (1,{}) (4,{}) (7,{(14)}) (8,{(15),(15)})

Finding maximum number of hits for each url ~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/pigUrlCount piginput/pigUrlCount grunt> urldata = LOAD 'piginput/pigUrlCount' USING PigStorage('\t') AS (url:chararray,hits:int); grunt> urlgroup= GROUP urldata by url; grunt> C = FOREACH urlgroup { >> ord = ORDER urldata BY hits DESC; >> top = LIMIT ord 1; >> GENERATE FLATTEN(top); >> };

grunt>DUMP C; (http://url10.com,267) (http://url11.com,361) (http://url12.com,361) (http://url13.com,324) (http://url14.com,361) (http://url15.com,324) (http://url16.com,361) (http://url17.com,361) (http://url18.com,324) (http://url19.com,361) ....

Using Java regular expression for Matching ~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/employcsv.csv piginput/employcsv; grunt> employdata = LOAD 'piginput/employcsv' USING PigStorage(',') AS (name:chararray,age:int,city:chararray,salary:int,country:chararray); grunt> DUMP employdata;

Obtaining employdata whose name start with “G” grunt> results = FILTER employdata BY (name matches 'G.*'); grunt> DUMP results; (Geoffrey,43,Woodward,10,Estonia) (Garth,38,Bloomington,5,Gabon) (Gray,75,Maywood,9,Greece) (Griffin,75,Monongahela,1,Tajikistan) (Gil,78,East St. Louis,6,Italy) Obtaining employdata whose age between 50-60 grunt> results = FILTER employdata BY (age matches '[50-60]'); //it will display error NOTE:operands of Regex can be CharArray only To do it use FILTER grunt> results = FILTER employdata BY age >50 AND age <60; grunt>DUMP results; (Timon,52,North Tonawanda,3,Zambia) (Harrison,56,Syracuse,5,Seychelles) (Bruce,52,Meridian,9,Iceland) (Keith,58,Hope,1,Andorra) (Mason,59,Manchester,7,Costa Rica)

Inner joins ignore null keys Use the JOIN operator to perform an inner join of two or more relations based on common field values. Inner joins ignore null keys JOIN creates a flat set of output records while COGROUP creates a nested set of output records. alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER] , right-alias BY right-alias-column A = LOAD 'data1' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) B = LOAD 'data2' AS (b1:int,b2:int); DUMP B; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) In this example relations A and B are joined by their first fields. X = JOIN A BY a1, B BY b1; DUMP X; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)

LEFT OUTER JOIN grunt> C = JOIN A by $0 LEFT OUTER, B BY $0; (1,2,3,1,3) (4,2,1,4,6) (4,2,1,4,9) (4,3,3,4,6) (4,3,3,4,9) (7,2,5,,) (8,3,4,8,9) (8,4,3,8,9) RIGHT OUTER JOIN grunt> C = JOIN A by $0 RIGHT OUTER, B BY $0; (1,2,3,1,3) (,,,2,4) (,,,2,7) (,,,2,9) (4,2,1,4,6) (4,2,1,4,9) (4,3,3,4,6) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)

grunt> C = JOIN A by $0 FULL, B BY $0; FULL OUTER JOIN grunt> C = JOIN A by $0 FULL, B BY $0; (1,2,3,1,3) (,,,2,4) (,,,2,7) (,,,2,9) (4,2,1,4,6) (4,2,1,4,9) (4,3,3,4,6) (4,3,3,4,9) (7,2,5,,) (8,3,4,8,9) (8,4,3,8,9) ILLUSTRATE The illustrate operator gives you the step-by-step execution of a sequence of statements.

Writing and Running Pig scripts in “localmode” from command prompt Start all deamons ~/hadoop-1.2.1$ bin/start-all.sh ~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave Start writing pig script and save it in current working directory with .pig extension Execute pig script from the command prompt ~/hadoop-1.2.1$ pig -x local urlcount.pig

We can view our results in the current directory itself

$pig -param input=info.log -param size=4 displaylog.pig Using Parameter subtitution in Pig Some times it is better to write resuable scripts that takes varying Input. Pig Supports parameter subtitution to allow user to specify such information at runtime.For example following script will display user specified number of tuples from log file --displaylog.pig log=LOAD '$input' AS (user,time,query) lmt=LIMIT log $size DUMP lmt The parameters in this script are $input and $size.If you run this script using pig command,you specify parameters using -param name=value Example: $pig -param input=info.log -param size=4 displaylog.pig

The parameter file is passesd to pig command with the If you have to specify many parameters,it may be more convienient to put them in a file.For example,we can create Myparams.txt with following content The parameter file is passesd to pig command with the -param_file filename argument Example: $pig -param_file Myparams.txt displaylog.pig

Writing and Running Pig scripts in “Mapreduce Mode” from command prompt *Start all deamons ~/hadoop-1.2.1$ bin/start-all.sh ~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave *Place you input file in HDFS ~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/pigUrlCount piginput/pigUrlCount *Start writing pig script and save it in current working directory with .pig extension Execute pig script from the command prompt ~/hadoop-1.2.1$ pig -x mapreduce urlcount.pig (or) ~/hadoop-1.2.1$ pig urlcount.pig

We can view our results in HDFS user directory

run [–param param_name = param_value] [–param_file file_name] script Running Pig scriptsfrom Grunt Shell *Using run command run [–param param_name = param_value] [–param_file file_name] script Use the run command to run a Pig script that can interact with the Grunt shell The Grunt shell has access to aliases defined within the script The script has access to aliases defined externally via the Grunt shell All commands from the script are visible in the command history. Issuing a run command basically the same effect as typing the commands manually.

The Grunt shell has access to aliases defined within the script grunt> DUMP C; (http://url1.com,324) (http://url10.com,267) (http://url11.com,361) (http://url12.com,361) (http://url13.com,324) (http://url14.com,361) (http://url15.com,324) (http://url16.com,361) (http://url17.com,361) (http://url18.com,324) (http://url19.com,361) (http://url2.com,267) (http://url20.com,361) (http://url3.com,269) (http://url4.com,269) (http://url5.com,269) (http://url6.com,324) (http://url7.com,324) (http://url8.com,269) (http://url9.com,287)

grunt> exec urlcount.pig *Using exec command exec [–param param_name = param_value] [–param_file file_name] script Use the exec command to run a Pig script with no interaction between the script and the Grunt shell Aliases defined in the script are not available to the shell Aliases defined via the shell are not available to the script Unlike the run command, exec does not change the command history grunt> exec urlcount.pig However, the files produced as the output of the script are visible.We can view our results in HDFS user directory

User Defined Functions in Pig Some times user requirement is not available in built-in functions at that time user can write some own custom user defined functions called UDF (user defined function). You can only write a UDF using pig's java API.To create UDF you make a java class that extends EvalFunc<T> abstract class.It has only one abstract method exec which you need to implement abstract public T exec(Tuple input) throws IOException This method is called on each tuple in a realtion,Where each tuple is represented by Tuple Object.Here T can be any one of the java classes Steps to create PIG UDFfor converting loaded data tuples in to upper case Step 1 : Conisder input file “names” ( in local file system or HDFS) rajesh sai kavitha peter Ali Step 2: Start hadoop deamons and pig (in local/Mapreduce mode)

Step 3: create simple Udf program in your eclipse Import pig-0.15-core-h1.jar,hadoop-core-1.2.1.jar,commons-logging-1.1.1.jar to your project

Step 4: Right click on project —> Export —> create Jar (pig_udf Step 4: Right click on project —> Export —> create Jar (pig_udf.jar) Step 5:Write Pig script and Register jar file with pig Step 6: Run above pig script to get output grunt> run name.pig (RAJESH) (SAI) (KAVITHA) (PETER) (ALI)

Step 4: Right click on project —> Export —> create Jar Step 5:Write Pig script and Register jar file with pig Step 6: Run above pig script to get output grunt> run name.pig (RAJESH) (SAI) (KAVITHA) (PETER) (ALI)

Embedded PigLatin programs Pig Latin statements can be executed with in Java,Pyyhon,or JavaScript programs.Such programs are called Embedded PigLatin programs Pig doesn't support control flow statements like if/else,while loop,for loop and condition statements.Pig natively supports data flow,but needs to be embedded within another language to provide control flow Procedure for creating Embedded PigLatin programs PigServer is a class for Java programs to connect to Pig. Typically a program will create a PigServer instance. The programmer then registers queries using registerQuery() and retrieves results using store(). After doing so, the shutdown() method should be called to free any resources used by the current PigServer instance. Not doing so could result in a memory leak. * public void registerQuery(String query) throws IOException Register a query with the Pig runtime. The query is parsed and registered, but it is not executed until it is needed.Here query - a Pig Latin expression to be evaluated. *public ExecJob store(String id,String filename) throws IOException Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file.ExecJob containing information about this job *public void shutdown() Reclaims resources used by this instance of PigServer. This method deletes all temporary files generated by the current thread while executing Pig commands.

Procedure to create Embedded Pig latin programs Step 1: create simple project with java program in your eclipse and add *pig library's *Hadoop library's (inside lib and outside lib) *Add path of hadoop configuration files to you project Right click on project ->Build Path ->Configure Build Path ->click on “Add vairable” -> Configure vairable->click on new then provide Name :PATH Path:/home/satish/hadoop-1.2.1/conf Then click ok Step 2: Place your input file in your project workspace directory.If ExecType = Local Place your input file in HDFS .If ExecType =MAPREDUCE

import java.io.IOException; import org.apache.pig.ExecType; import org.apache.pig.PigServer; public class WCCOUNT { public static void main(String[] args) throws IOException PigServer pigServer = new PigServer(ExecType.LOCAL); pigServer.registerQuery("input1 = LOAD 'foo' as (line:chararray);"); pigServer.registerQuery("words = foreach input1 generate FLATTEN(TOKENIZE(line)) as word;"); pigServer.store("words", "1"); pigServer.registerQuery("word_groups = group words by word;"); pigServer.store("word_groups", "2"); pigServer.registerQuery("word_count = foreach word_groups generate group, COUNT(words);"); pigServer.store("word_count", "3"); pigServer.registerQuery("ordered_word_count = order word_count by group desc;"); pigServer.registerQuery("store ordered_word_count into 'wct_output';"); pigServer.shutdown(); }

We can view our output in work space