Your Name
Recap Advance Built-In Function UDF Conclusion
Pig Advance
A platform for analyzing large data sets Local mode Distributed mode Script language(Pig Latin) but not equals to SQL
Key type : field, tuple, and bag Schema : way to assign name & type of a value Operators : useful built-in operators LOAD/STORE GROUP/COGROUP JOIN FILTER FOREACH (…) Tools : DUMP & DESCRIBE
Loading Data Working with Data Storing Intermediate Results Storing Final Results Debugging Pig Latin a = LOAD ‘data' AS (age:int, name:chararray); b = FILTER a BY (age > 75); c = FOREACH b GENERATE *; STORE c INTO 'population'; a = LOAD ‘data' AS (age:int, name:chararray); b = FILTER a BY (age > 75); c = FOREACH b GENERATE *; STORE c INTO 'population';
Pig Advance
Don’t need to be registered Don't need to be qualified when they are used Just use as you need!
EvalMathString AVGABSINDEXOF CONCATACOSLAST_INDEX_OF COUNTASINLCFIRST COUNT_STARCBRTUCFIRST DIFFCEILLOWER ISEMPTYCOSUPPER MAXCOSHREPLACE MINEXPSUBSTRING SIZEFLOORTRIM SUMLOGREGEX_EXTRACT TOKENIZELOG10REGEX_EXTRACT_ALL For complete reference, please visit herehere
NameSyntaxDescription TOTUPLE TOTUPLE(expression [, expression...]) Converts one or more expressions to type tuple. TOMAP TOMAP(key-expression, value- expression [, key-expression, value- expression...]) Converts key/value expression pairs into a map TOBAGTOBAG(expression [, expression...]) Converts one or more expressions to type bag TOPTOP(topN,column,relation) Returns the top-n tuples from a bag of tuples. For complete reference, please visit herehere
Computes the number of elements in a bag. Requiring a preceding GROUP ALL statement for global counts or a GROUP BY statement for group counts. It will ignore nulls. If you want to include NULL values in the count computation, use COUNT_STAR
a = LOAD 'data' AS (f1:int, f2:int, f3:int); b = GROUP a BY f1; x1 = FOREACH b GENERATE COUNT(a); x2 = FOREACH b GENERATE COUNT_STAR(a); a = LOAD 'data' AS (f1:int, f2:int, f3:int); b = GROUP a BY f1; x1 = FOREACH b GENERATE COUNT(a); x2 = FOREACH b GENERATE COUNT_STAR(a); DUMP x1; DUMP x2;
Computes the sum of the numeric values in a single-column bag. Requiring a preceding GROUP ALL statement for global sums and a GROUP BY statement for group sums. a = LOAD 'data' USING PigStorage(‘,’) AS (owner:chararray,pet_type:chararray,pet_cou nt:int); b = GROUP a BY owner; x = FOREACH b GENERATE group, SUM(a.pet_num); a = LOAD 'data' USING PigStorage(‘,’) AS (owner:chararray,pet_type:chararray,pet_cou nt:int); b = GROUP a BY owner; x = FOREACH b GENERATE group, SUM(a.pet_num); Alice,turtle,1 Alice,goldfish,5 Alice,cat,2 Bob,dog,2 Bob,cat,2 DUMP x;
PigStorage TextLoader JsonLoader/JsonStorage (Others)
Pig Advance
So called “User Defined Function” Currently, could be implemented by Java/Python/Javascript/Ruby. (The most extensive support is provided for Java) Types Eval Function Load/Store Function Piggy Bank – Before you write your own
Pig Types and Native Java Types Pig TypeJava Class bytearrayDataByteArray chararrayString intInteger longLong floatFloat doubleDouble tupleTuple bagDataBag mapMap
Compile pig.jar first Register UDF jar in your pig script Using the UDF with full name (package + class name) Example
EvalFunc public abstract T exec (Tuple input) throws IOException public Schema outputSchema (Schema input) public List getArgToFuncMapping () throws FrontendException
Extends EvalFunc Example: ChairbelongstoPhoenix PencialbelongstoVincent chair, tcloud_Phoenix pencial, tcloud_Vincent UDF Pig script
Extends EvalFunc Example: lamp#yellow desk#brown chair#green water#transparent (lamp,yellow) (desk,brown) (chair,green) (water,transparent) UDF Pig script
Extends FilterFunc Example : Mary,John,Steve#Steve Tom#Stevet Mary,John,Steve#Steve UDF Pig script
Basic class is LoadFunc/StoreFunc Aligned with Hadoop's InputFormat and OutputFormat
Extends LoadFunc getInputFormat prepareToRead setLocation getNext Example
Schema Error handling WrappedIOException (deprecated) Function overloading Reporting progress Protected data variabe in Class EvalFunc : reporter.progress();
Pig Latin + UDF = Easily To Analyze (Big) Data !