Apache PIG rev
Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive
Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world – Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin – Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations
Pig Elements Pig Latin – High-level scripting language – Requires no metadata or schema – Statements translated into a series of MapReduce jobs Grunt – Interactive shell Piggybank – Shared repository for User Defined Functions
Pig Latin Language for expressing data analysis and transformation processes Supports many traditional data operations – join, sort, filter, etc. Simplifies joining data and chaining jobs together
Pig Data Flow INPUT – LOAD From HDFS or Hcatalog TRANSFORM – With Pig Latin expressions OUTPUT – DUMP to console or STORE to HDFS
Pig Latin Execution The Pig interpreter immediately processes each entry If a statement is valid, it gets added to a logical plan built by the interpreter The steps in the plan do not execute in MapReduce until a DUMP or STORE command
Pig Latin Basic Concepts Structures – Field: Single piece of data – Tuple: Ordered set of fields (01234, 5.0, ABC) – Bag: Collection of tuples {(01234, 5.0, ABC), (44234, 12.2, DFE), (0124, 0.2, ABC)} Relational database equivalents – Fields = Fields – Tuple = Row – Bag ≅ Table (does not require all tuples to have same fields)
Pig Example Real example of a Pig script used at Twitter The Java equivalent…
Pig Commands users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray); Loading datasets from HDFS
Pig Commands users_1825 = filter users by age>=18 and age<=25; Filtering data
Pig Commands joined = join users_1825 by username, pages by username; Join datasets
Pig Commands grouped = group joined by url; Group records Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; ( {(alice, 15), (bob, 18)}) ( {(carol, 24), (alice, 14), (bob, 18)})
Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;
Pig Commands Sort a dataset sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;
Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';
Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';
Exercise: Running the HDP Tutorials tutorial/how-to-use-basic-pig-commands/ tutorial/how-to-process-data-with-apache- pig/ – It won’t work, find out why… (read notes for solution)
Pig Local Execution Mode Executes in a single JVM rather than on a cluster Works exclusively with local file system Great for development, debugging, experimentation and prototyping
Example: Remove header from a CSV file