Presented By: Imranul Hoque Pig (Latin) Demo Presented By: Imranul Hoque
Topics Last Seminar: Today: Hadoop Installation Running MapReduce Jobs MapReduce Code Status Monitoring Today: Complexity of writing MapReduce programs Pig Latin and Pig Pig Installation Running Pig
Example Problem Goal: for each sufficiently large category find the average pagerank of high-pagerank urls in that category URL Category Pagerank www.google.com Search Engine 0.9 www.cnn.com News 0.8 www.facebook.com Social Network 0.85 www.foxnews.com 0.78 www.foo.com Blah 0.1 www.bar.com 0.5
Example Problem (cont’d) SQL: SELECT category, AVG(pagerank) FROM url-table WHERE pagerank > 0.2 GROUP BY category HAVING count (*) > 10^6 MapReduce: ? Procedural (MapReduce) vs.Declarative (SQL) Pig Latin: Sweet spot between declarative and procedural Pig System Hadoop Pig Latin MapReduce
Pig Latin Solution urls = LOAD url-table as (url, category, pagerank) good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); For each sufficiently large category find the average pagerank of high-pagerank urls in that category
Features Dataflow language User defined function (UDF) Find the set of urls that are classified as spams but have a high pagerank score spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; User defined function (UDF) Debugging environment Nested data model
Pig Latin Commands load Read data from file system. store Write data to file system. foreach Apply expression to each record and output one or more records. filter Apply predicate and remove records that do not return true. group/cogroup Collect records with the same key from one or more inputs. join Join two or more inputs based on a key. order Sort records based on a key. distinct Remove duplicate records. union Merge two data sets. dump Write output to stdout. limit Limit the number of records.
Pig System parsed Pig Latin program program cross-job output optimizer user parsed program Parser Pig Latin program execution plan Pig Compiler cross-job optimizer join output filter X f( ) Y map-red. jobs MR Compiler Map-Reduce Cluster
MapReduce Compiler
Pig Pen Find users who tend to visit “good” pages Transform to (user, Canonicalize(url), time) Load Pages(url, pagerank) Visits(user, url, time) Join url = url Group by user to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5
Challenges? Load Visits(user, url, time) Load Pages(url, pagerank) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (www.cnn.com, 0.9) (www.snails.com, 0.4) Transform to (user, Canonicalize(url), time) Join url = url (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgPR) (Amy, 0.65) (Fred, 0.4) Challenges? Filter avgPR > 0.5 (Amy, 0.65)
Installation Extract Build (ant) Environment variable In pig-0.1.1 and in tutorial dir Environment variable PIGDIR=~/pig-0.1.1 HADOOPSITEPATH=~/hadoop-0.18.3/conf
Running Pig Two modes: Three ways to execute: Local mode Hadoop mode Shell (grunt) Script API (currently Java) GUI (future work)
Running Pig (2) Save data into HDFS Launch shell/Run script bin/hadoop -copyFromLocal excite-small.log excite-small.log Launch shell/Run script java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -x mapreduce <script_name> Our script: script1-hadoop.pig
Conclusion For more details: http://hadoop.apache.org/core/ http://wiki.apache.org/hadoop/ http://hadoop.apache.org/pig/ http://wiki.apache.org/pig/