Download presentation
Presentation is loading. Please wait.
Published byJosie Hopewell Modified over 9 years ago
1
Your Name
2
Recap Advance Built-In Function UDF Conclusion
3
Pig Advance
4
A platform for analyzing large data sets Local mode Distributed mode Script language(Pig Latin) but not equals to SQL
5
Key type : field, tuple, and bag Schema : way to assign name & type of a value Operators : useful built-in operators LOAD/STORE GROUP/COGROUP JOIN FILTER FOREACH (…) Tools : DUMP & DESCRIBE
6
Loading Data Working with Data Storing Intermediate Results Storing Final Results Debugging Pig Latin a = LOAD ‘data' AS (age:int, name:chararray); b = FILTER a BY (age > 75); c = FOREACH b GENERATE *; STORE c INTO 'population'; a = LOAD ‘data' AS (age:int, name:chararray); b = FILTER a BY (age > 75); c = FOREACH b GENERATE *; STORE c INTO 'population';
7
Pig Advance
8
Don’t need to be registered Don't need to be qualified when they are used Just use as you need!
9
EvalMathString AVGABSINDEXOF CONCATACOSLAST_INDEX_OF COUNTASINLCFIRST COUNT_STARCBRTUCFIRST DIFFCEILLOWER ISEMPTYCOSUPPER MAXCOSHREPLACE MINEXPSUBSTRING SIZEFLOORTRIM SUMLOGREGEX_EXTRACT TOKENIZELOG10REGEX_EXTRACT_ALL For complete reference, please visit herehere
10
NameSyntaxDescription TOTUPLE TOTUPLE(expression [, expression...]) Converts one or more expressions to type tuple. TOMAP TOMAP(key-expression, value- expression [, key-expression, value- expression...]) Converts key/value expression pairs into a map TOBAGTOBAG(expression [, expression...]) Converts one or more expressions to type bag TOPTOP(topN,column,relation) Returns the top-n tuples from a bag of tuples. For complete reference, please visit herehere
11
Computes the number of elements in a bag. Requiring a preceding GROUP ALL statement for global counts or a GROUP BY statement for group counts. It will ignore nulls. If you want to include NULL values in the count computation, use COUNT_STAR
12
a = LOAD 'data' AS (f1:int, f2:int, f3:int); b = GROUP a BY f1; x1 = FOREACH b GENERATE COUNT(a); x2 = FOREACH b GENERATE COUNT_STAR(a); a = LOAD 'data' AS (f1:int, f2:int, f3:int); b = GROUP a BY f1; x1 = FOREACH b GENERATE COUNT(a); x2 = FOREACH b GENERATE COUNT_STAR(a); 1 2 3 8 3 4 7 2 5 8 4 3 1 1 1 2 3 8 3 4 7 2 5 8 4 3 1 1 DUMP x1; DUMP x2;
13
Computes the sum of the numeric values in a single-column bag. Requiring a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums. a = LOAD 'data' USING PigStorage(‘,’) AS (owner:chararray,pet_type:chararray,pet_cou nt:int); b = GROUP a BY owner; x = FOREACH b GENERATE group, SUM(a.pet_num); a = LOAD 'data' USING PigStorage(‘,’) AS (owner:chararray,pet_type:chararray,pet_cou nt:int); b = GROUP a BY owner; x = FOREACH b GENERATE group, SUM(a.pet_num); Alice,turtle,1 Alice,goldfish,5 Alice,cat,2 Bob,dog,2 Bob,cat,2 DUMP x;
14
PigStorage TextLoader JsonLoader/JsonStorage (Others)
15
Pig Advance
16
So called “User Defined Function” Currently, could be implemented by Java/Python/Javascript/Ruby. (The most extensive support is provided for Java) Types Eval Function Load/Store Function Piggy Bank – Before you write your own https://cwiki.apache.org/confluence/display/PIG/PiggyBank
17
Pig Types and Native Java Types Pig TypeJava Class bytearrayDataByteArray chararrayString intInteger longLong floatFloat doubleDouble tupleTuple bagDataBag mapMap
18
Compile pig.jar first Register UDF jar in your pig script Using the UDF with full name (package + class name) Example
19
EvalFunc public abstract T exec (Tuple input) throws IOException public Schema outputSchema (Schema input) public List getArgToFuncMapping () throws FrontendException
20
Extends EvalFunc Example: ChairbelongstoPhoenix PencialbelongstoVincent chair, tcloud_Phoenix pencial, tcloud_Vincent UDF Pig script
21
Extends EvalFunc Example: lamp#yellow desk#brown chair#green water#transparent (lamp,yellow) (desk,brown) (chair,green) (water,transparent) UDF Pig script
22
Extends FilterFunc Example : Mary,John,Steve#Steve Tom#Stevet Mary,John,Steve#Steve UDF Pig script
23
Basic class is LoadFunc/StoreFunc Aligned with Hadoop's InputFormat and OutputFormat
24
Extends LoadFunc getInputFormat prepareToRead setLocation getNext Example
25
Schema Error handling WrappedIOException (deprecated) Function overloading Reporting progress Protected data variabe in Class EvalFunc : reporter.progress();
26
Pig Latin + UDF = Easily To Analyze (Big) Data !
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.