LOAD ,DUMP,DESCRIBE operators

LOAD ,DUMP,DESCRIBE operators

Employ data 001 Robin 22 newyork 002 BO 23 Kolkata 003 Maya 23 Tokyo 004 Sara 25 London 005 David 23 Bhuwaneshwar 006 Maggy 22 Chennai grunt> A = LOAD 'piginput/employ'; grunt> DUMP A; (001,Robin,22,newyork) (002,BO,23,Kolkata) (003,Maya,23,Tokyo) (004,Sara,25,London) (005,David,23,Bhuwaneshwar) (006,Maggy,22,Chennai) grunt> DESCRIBE A; Schema for A unknown. grunt>

Defining schema and Using function
001:Robin:22:newyork 002:BO:23:Kolkata 003 :Maya:23:Tokyo 004:Sara:25:London 005 :David:23:Bhuwaneshwar 006 :Maggy:22 :Chennai Employ data grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city); grunt> DUMP A; (1,Robin,22,newyork) (2,BO,23,Kolkata) (3,Maya,23,Tokyo) (4,Sara,25,London) (5,David,23,Bhuwaneshwar) (6,Maggy,22,Chennai) grunt> DESCRIBE A; A: {id: int,name: chararray,age: int,city: bytearray} If we miss out types of fields,then default type will be bytearray grunt> A= LOAD 'piginput/emply' USING PigStorage(':') AS (id,name:chararray,age:int,city); grunt> DESCRIBE A; A: {id: bytearray,name: chararray,age: int,city: bytearray} grunt>

GROUP and COGROUP operators

Grouping Employ by age grunt> A= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city); grunt> B= GROUP A by age; grunt>DUMP B; (22,{(1,Robin,22,newyork),(6,Maggy,22,Chennai)}) (23,{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)}) (25,{(4,Sara,25,London)}) grunt> DESCRIBE B; B: {group: int,A: {(id: int,name: chararray,age: int,city: bytearray)}} grunt> Grouping Employ by ALL grunt>B = GROUP A ALL; (all,{(1,Robin,22,newyork),(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(4,Sara,25,London),(5,David,23,Bhuwaneshwar),(6,Maggy,22,Chennai)})

Grouping by Multiple fields
Let us consider student data 001,Rajiv,Reddy,21, ,Hyderabad 002,siddarth,Battacharya,22, ,Kolkata 003,Rajesh,Khanna,21, ,Hyderabad 004,Preethi,Agarwal,21, ,Punei 005,Trupthi,Mohanthy,23, ,Chennai 006,Archana,Mishra,23, ,Chennai 007,Komal,Nayak,24, ,trivendram 008,Bharathi,Nambiayar,24, ,trivendram grunt> A= LOAD 'piginput/student' USING PigStorage(',') AS (id:int,firstname:chararray,lastname:chararray,age:int,phno:long,city:chararray); grunt> g1 = GROUP A by (age,city); grunt> DUMP g1; ((21,Punei),{(4,Preethi,Agarwal,21, ,Punei)}) ((21,Hyderabad),{(1,Rajiv,Reddy,21, ,Hyderabad)(3,Rajesh,Khanna,21, ,Hyderabad)}) ((22,Kolkata),{(2,siddarth,Battacharya,22, ,Kolkata)}) ((23,Chennai),{(5,Trupthi,Mohanthy,23, ,Chennai),(6,Archana,Mishra,23, ,Chennai)}) ((24,trivendram),{(7,Komal,Nayak,24, ,trivendram)(8,Bharathi,Nambiayar,24, ,trivendram)})

Grouping by Expression
grunt>A = LOAD 'data' as (f1:chararray, f2:int, f3:int); grunt>DUMP A; (r1,1,2) (r2,2,1) (r3,2,8) (r4,4,4) In this example the tuples are grouped using an expression, f2*f3. grunt>X = GROUP A BY f2*f3; grunt>DUMP X; (2,{(r1,1,2),(r2,2,1)}) (16,{(r3,2,8),(r4,4,4)}) The COGROUP operator works more or less in the same way as the GROUP operator. The only difference is that the group operator is normally used with one relation, while the cogroup operator is used in statements involving two or more relations.

grunt> employ= LOAD 'piginput/employ' USING PigStorage(':') AS (id:int,name:chararray,age:int,city); grunt> student= LOAD 'piginput/student' USING PigStorage(',') AS (id:int,firstname:chararray,lastname:chararray,age:int,phno:long,city:chararray); grunt> cogroup_data = COGROUP student by age, employ by age; grunt> DUMP cogroup_data (21,{(1,Rajiv,Reddy,21, ,Hyderabad),(3,Rajesh,Khanna,21, ,Hyderabad),(4,Preethi,Agarwal,21, ,Punei)},{}) (22,{(2,siddarth,Battacharya,22, ,Kolkata)},{(1,Robin,22,newyork),(6,Maggy,22,Chennai)}) (23,{(5,Trupthi,Mohanthy,23, ,Chennai),(6,Archana,Mishra,23, ,Chennai)},{(2,BO,23,Kolkata),(3,Maya,23,Tokyo),(5,David,23,Bhuwaneshwar)}) (24,{(7,Komal,Nayak,24, ,trivendram),(8,Bharathi,Nambiayar,24, ,trivendram)},{}) (25,{},{(4,Sara,25,London)})

grunt> order_by_data = ORDER student BY age DESC;
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields. grunt> order_by_data = ORDER student BY age DESC; grunt> DUMP order_by_data; (7,Komal,Nayak,24, ,trivendram) (8,Bharathi,Nambiayar,24, ,trivendram) (5,Trupthi,Mohanthy,23, ,Chennai) (6,Archana,Mishra,23, ,Chennai) (2,siddarth,Battacharya,22, ,Kolkata) (1,Rajiv,Reddy,21, ,Hyderabad) (3,Rajesh,Khanna,21, ,Hyderabad) (4,Preethi,Agarwal,21, ,Punei) Sorting based on more than one field grunt> order_by_data = ORDER student BY $3,$1 ASC; grunt> DUMP order_by_data; (1,Rajiv,Reddy,21, ,Hyderabad) (3,Rajesh,Khanna,21, ,Hyderabad) (4,Preethi,Agarwal,21, ,Punei) (2,siddarth,Battacharya,22, ,Kolkata) (5,Trupthi,Mohanthy,23, ,Chennai) (6,Archana,Mishra,23, ,Chennai) (7,Komal,Nayak,24, ,trivendram) (8,Bharathi,Nambiayar,24, ,trivendram)

LIMIT Use the LIMIT operator to limit the number of output tuples. If the specified number of output tuples is equal to or exceeds the number of tuples in the relation, the output will include all tuples in the relation. A particular set of tuples can be requested using the ORDER operator followed by LIMIT. grunt> X= LIMIT student 3; grunt> DUMP X; (1,Rajiv,Reddy,21, ,Hyderabad) (2,siddarth,Battacharya,22, ,Kolkata) (3,Rajesh,Khanna,21, ,Hyderabad) DISTINCT Use the DISTINCT operator to remove duplicate tuples in a relation. DISTINCT does not preserve the original order of the contents You cannot use DISTINCT on a subset of fields. grunt> A = LOAD 'data' USING PigStorage(' ') AS (a1:int,a2:int,a3:int); grunt> DUMP A; (8,3,4) (1,2,3) (4,3,3)

grunt> X = DISTINCT A; grunt> DUMP X;
(1,2,3) (4,3,3) (8,3,4) FILTER Selects tuples from a relation based on some condition. Use the FILTER operator to work with tuples or rows of data. FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want grunt> X = FILTER A BY a3 == 3; grunt> X = FILTER A BY (a1 == 8) OR (NOT a2+a3 > a1)); grunt> DUMP X; grunt> DUMP X; (8,3,4) (1,2,3) (4,3,3)

CROSS Use the CROSS operator to compute the cross product (Cartesian product) of two or more relations. grunt>A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt>DUMP A; (1,2,3) (4,2,1) grunt>B = LOAD 'data2' AS (b1:int,b2:int); grunt>DUMP B; (2,4) (8,9) (1,3) In this example the cross product of relation A and B is computed. grunt>X = CROSS A, B; grunt>DUMP X; (1,2,3,2,4) (1,2,3,8,9) (1,2,3,1,3) (4,2,1,2,4) (4,2,1,8,9) (4,2,1,1,3)

as unordered bags of tuples.
UNION Use the UNION operator to merge the contents of two or more relations. The UNION operator: • Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples. • Does not ensure (as databases do) that all tuples belongs to the same schema or that they have the same number of fields. Does not eliminate duplicate tuples. grunt>A = LOAD 'data' AS (a1:int,a2:int,a3:int); grunt>DUMP A; (1,2,3) (4,2,1) grunt>B = LOAD 'data' AS (b1:int,b2:int); grunt>DUMP B; (2,4) (8,9) (1,3) grunt>X = UNION A, B; grunt>DUMP X;

• A tuple may be assigned to more than one relation.
SPLIT Use the SPLIT operator to partition the contents of a relation into two or more relations based on some expression. Depending on the conditions stated in the expression: • A tuple may be assigned to more than one relation. • A tuple may not be assigned to any relation. SPLIT alias INTO alias IF expression, alias IF expression [, alias IF expression ...]; In this example relation A is split into three relations, X, Y, and Z. grunt>A = LOAD 'data' AS (f1:int,f2:int,f3:int); grunt>DUMP A; (1,2,3) (4,5,6) (7,8,9) grunt>SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6); grunt>DUMP X; grunt>DUMP Y; grunt>DUMP Z;

FLATTEN FLATTEN will eliminate a level of nesting by un-nesting tuples as well as bags. It will convert innerbags in to tuples For example, consider a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten($1), will cause that tuple to become (a, b, c). grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (v1:int,v2:int,v3:int); grunt> Result = GROUP A BY v1; grunt> DUMP Result; (1,{(1,2,3)}) (4,{(4,2,1),(4,3,3)}) (7,{(7,2,5)}) (8,{(8,3,4),(8,4,3)}) grunt> X = FOREACH Result GENERATE group, FLATTEN(A); (1,1,2,3) (4,4,2,1) (4,4,3,3) (7,7,2,5) (8,8,3,4) (8,8,4,3)

Another FLATTEN example
Another FLATTEN example.(Obtaining Group number and any particular value) grunt> X = FOREACH Result GENERATE group, FLATTEN(A.v3); (1,3) (4,1) (4,3) (7,5) (8,4) (8,3) Another FLATTEN example.(Obtaining Group number and any two values) grunt> X = FOREACH Result GENERATE group, FLATTEN(A.(v1,v3)); grunt>DUMP X; (1,1,3) (4,4,1) (4,4,3) (7,7,5) (8,8,4) (8,8,3)

work with tuples or rows of data, use the FILTER operation).
FOREACH Use the FOREACH ...GENERATE operation to work with columns of data (if you want to work with tuples or rows of data, use the FILTER operation). grunt>A = LOAD 'data1' AS (a1:int,a2:int,a3:int); grunt>DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) grunt>X = FOREACH A GENERATE *; grunt>DUMP X; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)

:~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/in piginput/in
:~/hadoop-1.2.1$ pig -x mapreduce grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (a1:int,a2:int,a3:int); grunt> DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3)

grunt> X = FOREACH A GENERATE a1, a2;
grunt> STORE X INTO 'output"; Two fields in relation A are summed to form relation X. A schema is defined for the projected field. grunt> X = FOREACH A GENERATE a1+a2 AS f1:int >> ; grunt> DESCRIBE X; X: {f1: int} grunt> STORE X INTO 'output';

Obtaining Sum of Group fields
grunt> A = LOAD 'piginput/in' USING PigStorage(' ') AS (a1:int,a2:int,a3:int); grunt> w1= GROUP A by a1; grunt> X = FOREACH w1 { >> Y=FOREACH A GENERATE a1+a2+a3 AS f1:int; >> GENERATE group,Y; >> }; grunt> DUMP X; (1,{(6)}) (4,{(7),(10)}) (7,{(14)}) (8,{(15),(15)}) Obtaining Sum of Group fields whose sum >10 grunt> X = FOREACH w1 { >> Y=FOREACH A GENERATE a1+a2+a3 AS f1:int; >> Z= FILTER Y BY (f1 > 10); >> GENERATE group,Z; >> }; grunt> DUMP X (1,{}) (4,{}) (7,{(14)}) (8,{(15),(15)})

Finding maximum number of hits for each url
~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/pigUrlCount piginput/pigUrlCount grunt> urldata = LOAD 'piginput/pigUrlCount' USING PigStorage('\t') AS (url:chararray,hits:int); grunt> urlgroup= GROUP urldata by url; grunt> C = FOREACH urlgroup { >> ord = ORDER urldata BY hits DESC; >> top = LIMIT ord 1; >> GENERATE FLATTEN(top); >> };

grunt>DUMP C; ( ( ( ( ( ( ( ( ( ( ....

Using Java regular expression for Matching
~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/employcsv.csv piginput/employcsv; grunt> employdata = LOAD 'piginput/employcsv' USING PigStorage(',') AS (name:chararray,age:int,city:chararray,salary:int,country:chararray); grunt> DUMP employdata;

Obtaining employdata whose name start with “G”
grunt> results = FILTER employdata BY (name matches 'G.*'); grunt> DUMP results; (Geoffrey,43,Woodward,10,Estonia) (Garth,38,Bloomington,5,Gabon) (Gray,75,Maywood,9,Greece) (Griffin,75,Monongahela,1,Tajikistan) (Gil,78,East St. Louis,6,Italy) Obtaining employdata whose age between 50-60 grunt> results = FILTER employdata BY (age matches '[50-60]'); //it will display error NOTE:operands of Regex can be CharArray only To do it use FILTER grunt> results = FILTER employdata BY age >50 AND age <60; grunt>DUMP results; (Timon,52,North Tonawanda,3,Zambia) (Harrison,56,Syracuse,5,Seychelles) (Bruce,52,Meridian,9,Iceland) (Keith,58,Hope,1,Andorra) (Mason,59,Manchester,7,Costa Rica)

Inner joins ignore null keys
Use the JOIN operator to perform an inner join of two or more relations based on common field values. Inner joins ignore null keys JOIN creates a flat set of output records while COGROUP creates a nested set of output records. alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER] , right-alias BY right-alias-column A = LOAD 'data1' AS (a1:int,a2:int,a3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) (7,2,5) (8,4,3) B = LOAD 'data2' AS (b1:int,b2:int); DUMP B; (2,4) (8,9) (1,3) (2,7) (2,9) (4,6) (4,9) In this example relations A and B are joined by their first fields. X = JOIN A BY a1, B BY b1; DUMP X; (1,2,3,1,3) (4,2,1,4,6) (4,3,3,4,6) (4,2,1,4,9) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)

LEFT OUTER JOIN grunt> C = JOIN A by $0 LEFT OUTER, B BY $0; (1,2,3,1,3) (4,2,1,4,6) (4,2,1,4,9) (4,3,3,4,6) (4,3,3,4,9) (7,2,5,,) (8,3,4,8,9) (8,4,3,8,9) RIGHT OUTER JOIN grunt> C = JOIN A by $0 RIGHT OUTER, B BY $0; (1,2,3,1,3) (,,,2,4) (,,,2,7) (,,,2,9) (4,2,1,4,6) (4,2,1,4,9) (4,3,3,4,6) (4,3,3,4,9) (8,3,4,8,9) (8,4,3,8,9)

grunt> C = JOIN A by $0 FULL, B BY $0;
FULL OUTER JOIN grunt> C = JOIN A by $0 FULL, B BY $0; (1,2,3,1,3) (,,,2,4) (,,,2,7) (,,,2,9) (4,2,1,4,6) (4,2,1,4,9) (4,3,3,4,6) (4,3,3,4,9) (7,2,5,,) (8,3,4,8,9) (8,4,3,8,9) ILLUSTRATE The illustrate operator gives you the step-by-step execution of a sequence of statements.

Writing and Running Pig scripts in “localmode” from command prompt
Start all deamons ~/hadoop-1.2.1$ bin/start-all.sh ~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave Start writing pig script and save it in current working directory with .pig extension Execute pig script from the command prompt ~/hadoop-1.2.1$ pig -x local urlcount.pig

We can view our results in the current directory itself

$pig -param input=info.log -param size=4 displaylog.pig
Using Parameter subtitution in Pig Some times it is better to write resuable scripts that takes varying Input. Pig Supports parameter subtitution to allow user to specify such information at runtime.For example following script will display user specified number of tuples from log file --displaylog.pig log=LOAD '$input' AS (user,time,query) lmt=LIMIT log $size DUMP lmt The parameters in this script are $input and $size.If you run this script using pig command,you specify parameters using -param name=value Example: $pig -param input=info.log -param size=4 displaylog.pig

The parameter file is passesd to pig command with the
If you have to specify many parameters,it may be more convienient to put them in a file.For example,we can create Myparams.txt with following content The parameter file is passesd to pig command with the -param_file filename argument Example: $pig -param_file Myparams.txt displaylog.pig

Writing and Running Pig scripts in “Mapreduce Mode” from command prompt
*Start all deamons ~/hadoop-1.2.1$ bin/start-all.sh ~/hadoop-1.2.1$ bin/hadoop dfsadmin -safemode leave *Place you input file in HDFS ~/hadoop-1.2.1$ bin/hadoop dfs -put piginput/pigUrlCount piginput/pigUrlCount *Start writing pig script and save it in current working directory with .pig extension Execute pig script from the command prompt ~/hadoop-1.2.1$ pig -x mapreduce urlcount.pig (or) ~/hadoop-1.2.1$ pig urlcount.pig

We can view our results in HDFS user directory

run [–param param_name = param_value] [–param_file file_name] script
Running Pig scriptsfrom Grunt Shell *Using run command run [–param param_name = param_value] [–param_file file_name] script Use the run command to run a Pig script that can interact with the Grunt shell The Grunt shell has access to aliases defined within the script The script has access to aliases defined externally via the Grunt shell All commands from the script are visible in the command history. Issuing a run command basically the same effect as typing the commands manually.

The Grunt shell has access to aliases defined within the script
grunt> DUMP C; ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( (

grunt> exec urlcount.pig
*Using exec command exec [–param param_name = param_value] [–param_file file_name] script Use the exec command to run a Pig script with no interaction between the script and the Grunt shell Aliases defined in the script are not available to the shell Aliases defined via the shell are not available to the script Unlike the run command, exec does not change the command history grunt> exec urlcount.pig However, the files produced as the output of the script are visible.We can view our results in HDFS user directory

User Defined Functions in Pig
Some times user requirement is not available in built-in functions at that time user can write some own custom user defined functions called UDF (user defined function). You can only write a UDF using pig's java API.To create UDF you make a java class that extends EvalFunc<T> abstract class.It has only one abstract method exec which you need to implement abstract public T exec(Tuple input) throws IOException This method is called on each tuple in a realtion,Where each tuple is represented by Tuple Object.Here T can be any one of the java classes Steps to create PIG UDFfor converting loaded data tuples in to upper case Step 1 : Conisder input file “names” ( in local file system or HDFS) rajesh sai kavitha peter Ali Step 2: Start hadoop deamons and pig (in local/Mapreduce mode)

Step 3: create simple Udf program in your eclipse
Import pig-0.15-core-h1.jar,hadoop-core jar,commons-logging jar to your project

Step 4: Right click on project —> Export —> create Jar (pig_udf
Step 4: Right click on project —> Export —> create Jar (pig_udf.jar) Step 5:Write Pig script and Register jar file with pig Step 6: Run above pig script to get output grunt> run name.pig (RAJESH) (SAI) (KAVITHA) (PETER) (ALI)

Step 4: Right click on project —> Export —> create Jar
Step 5:Write Pig script and Register jar file with pig Step 6: Run above pig script to get output grunt> run name.pig (RAJESH) (SAI) (KAVITHA) (PETER) (ALI)

Embedded PigLatin programs
Pig Latin statements can be executed with in Java,Pyyhon,or JavaScript programs.Such programs are called Embedded PigLatin programs Pig doesn't support control flow statements like if/else,while loop,for loop and condition statements.Pig natively supports data flow,but needs to be embedded within another language to provide control flow Procedure for creating Embedded PigLatin programs PigServer is a class for Java programs to connect to Pig. Typically a program will create a PigServer instance. The programmer then registers queries using registerQuery() and retrieves results using store(). After doing so, the shutdown() method should be called to free any resources used by the current PigServer instance. Not doing so could result in a memory leak. * public void registerQuery(String query) throws IOException Register a query with the Pig runtime. The query is parsed and registered, but it is not executed until it is needed.Here query - a Pig Latin expression to be evaluated. *public ExecJob store(String id,String filename) throws IOException Executes a Pig Latin script up to and including indicated alias and stores the resulting records into a file.ExecJob containing information about this job *public void shutdown() Reclaims resources used by this instance of PigServer. This method deletes all temporary files generated by the current thread while executing Pig commands.

Procedure to create Embedded Pig latin programs
Step 1: create simple project with java program in your eclipse and add *pig library's *Hadoop library's (inside lib and outside lib) *Add path of hadoop configuration files to you project Right click on project ->Build Path ->Configure Build Path ->click on “Add vairable” -> Configure vairable->click on new then provide Name :PATH Path:/home/satish/hadoop-1.2.1/conf Then click ok Step 2: Place your input file in your project workspace directory.If ExecType = Local Place your input file in HDFS .If ExecType =MAPREDUCE

import java.io.IOException;
import org.apache.pig.ExecType; import org.apache.pig.PigServer; public class WCCOUNT { public static void main(String[] args) throws IOException PigServer pigServer = new PigServer(ExecType.LOCAL); pigServer.registerQuery("input1 = LOAD 'foo' as (line:chararray);"); pigServer.registerQuery("words = foreach input1 generate FLATTEN(TOKENIZE(line)) as word;"); pigServer.store("words", "1"); pigServer.registerQuery("word_groups = group words by word;"); pigServer.store("word_groups", "2"); pigServer.registerQuery("word_count = foreach word_groups generate group, COUNT(words);"); pigServer.store("word_count", "3"); pigServer.registerQuery("ordered_word_count = order word_count by group desc;"); pigServer.registerQuery("store ordered_word_count into 'wct_output';"); pigServer.shutdown(); }

We can view our output in work space

LOAD ,DUMP,DESCRIBE operators

Similar presentations

Presentation on theme: "LOAD ,DUMP,DESCRIBE operators"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LOAD ,DUMP,DESCRIBE operators

Similar presentations

Presentation on theme: "LOAD ,DUMP,DESCRIBE operators"— Presentation transcript:

Similar presentations

About project

Feedback