Presented By: Imranul Hoque

Slides:



Advertisements
Similar presentations
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Advertisements

How to make your map-reduce jobs perform as well as pig: Lessons from pig optimizations Thejas Nair pig Yahoo! Apache pig.
Pig Optimization and Execution Page 1 Alan F. © Hortonworks Inc
Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.
Hadoop Pig By Ravikrishna Adepu.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.
Alan F. Gates Yahoo! Pig, Making Hadoop Easy Who Am I? Pig committer Hadoop PMC Member An architect in Yahoo! grid team Or, as one coworker put.
Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.
High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olsten, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Acknowledgement.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
The Hadoop Stack, Part 1 Introduction to Pig Latin CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain University of Notre Dame.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
(Hadoop) Pig Dataflow Language B. Ramamurthy Based on Cloudera’s tutorials and Apache’s Pig Manual 6/27/2015.
Pig Latin Olston, Reed, Srivastava, Kumar, and Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. SIGMOD Shahram Ghandeharizadeh.
Hadoop Demo Presented by: Imranul Hoque 1. Topics Hadoop running modes – Stand alone – Pseudo distributed – Cluster Running MapReduce jobs Status/logs.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Introduction to Apache Hadoop CSCI 572: Information Retrieval and Search Engines Summer 2010.
HADOOP ADMIN: Session -2
Pig Acknowledgement: Modified slides from Duke University 04/13/10 Cloud Computing Lecture.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.
Big Data Analytics Training
Pig Latin CS 6800 Utah State University. Writing MapReduce Jobs Higher order functions Map applies a function to a list Example list [1, 2, 3, 4] Want.
Making Hadoop Easy pig
Storage and Analysis of Tera-scale Data : 2 of Database Class 11/24/09
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
RESTORE IMPLEMENTATION as an extension to pig Vijay S.
Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MAP-REDUCE ABSTRACTIONS 1. Abstractions On Top Of Hadoop We’ve decomposed some algorithms into a map-reduce “workflow” (series of map-reduce steps) –
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Alan Gates Becoming a Pig Developer Who Am I? Pig committer Hadoop PMC Member Yahoo! architect for Pig.
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
CS347: Map-Reduce & Pig Hector Garcia-Molina Stanford University CS347Notes 09 1.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Apache Pig CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Data Cleansing with Pig Latin. Neubot Tests Data Structure.
MapReduce Compilers-Apache Pig
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Hadoop.
Unit 5 Working with pig.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Pig : Building High-Level Dataflows over Map-Reduce
Big Data Analytics: HW#3
Pig Latin - A Not-So-Foreign Language for Data Processing
Pig Latin: A Not-So-Foreign Language for Data Processing
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hector Garcia-Molina Stanford University
Slides borrowed from Adam Shook
The Idea of Pig Or Pig Concepts
Pig : Building High-Level Dataflows over Map-Reduce
CSE 491/891 Lecture 21 (Pig).
Charles Tappert Seidenberg School of CSIS, Pace University
(Hadoop) Pig Dataflow Language
Hadoop – PIG.
(Hadoop) Pig Dataflow Language
04 | Processing Big Data with Pig
Pig and pig latin: An Introduction
Pig Hive HBase Zookeeper
Presentation transcript:

Presented By: Imranul Hoque Pig (Latin) Demo Presented By: Imranul Hoque

Topics Last Seminar: Today: Hadoop Installation Running MapReduce Jobs MapReduce Code Status Monitoring Today: Complexity of writing MapReduce programs Pig Latin and Pig Pig Installation Running Pig

Example Problem Goal: for each sufficiently large category find the average pagerank of high-pagerank urls in that category URL Category Pagerank www.google.com Search Engine 0.9 www.cnn.com News 0.8 www.facebook.com Social Network 0.85 www.foxnews.com 0.78 www.foo.com Blah 0.1 www.bar.com 0.5

Example Problem (cont’d) SQL: SELECT category, AVG(pagerank) FROM url-table WHERE pagerank > 0.2 GROUP BY category HAVING count (*) > 10^6 MapReduce: ? Procedural (MapReduce) vs.Declarative (SQL) Pig Latin: Sweet spot between declarative and procedural Pig System Hadoop Pig Latin MapReduce

Pig Latin Solution urls = LOAD url-table as (url, category, pagerank) good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls) > 10^6; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); For each sufficiently large category find the average pagerank of high-pagerank urls in that category

Features Dataflow language User defined function (UDF) Find the set of urls that are classified as spams but have a high pagerank score spam_urls = FILTER urls BY isSpam(url); culprit_urls = FILTER spam_urls BY pagerank > 0.8; User defined function (UDF) Debugging environment Nested data model

Pig Latin Commands load Read data from file system. store Write data to file system. foreach Apply expression to each record and output one or more records. filter Apply predicate and remove records that do not return true. group/cogroup Collect records with the same key from one or more inputs. join Join two or more inputs based on a key. order Sort records based on a key. distinct Remove duplicate records. union Merge two data sets. dump Write output to stdout. limit Limit the number of records.

Pig System parsed Pig Latin program program cross-job output optimizer user parsed program Parser Pig Latin program execution plan Pig Compiler cross-job optimizer join output filter X f( ) Y map-red. jobs MR Compiler Map-Reduce Cluster

MapReduce Compiler

Pig Pen Find users who tend to visit “good” pages Transform to (user, Canonicalize(url), time) Load Pages(url, pagerank) Visits(user, url, time) Join url = url Group by user to (user, Average(pagerank) as avgPR) Filter avgPR > 0.5

Challenges? Load Visits(user, url, time) Load Pages(url, pagerank) (Amy, cnn.com, 8am) (Amy, http://www.snails.com, 9am) (Fred, www.snails.com/index.html, 11am) (www.cnn.com, 0.9) (www.snails.com, 0.4) Transform to (user, Canonicalize(url), time) Join url = url (Amy, www.cnn.com, 8am) (Amy, www.snails.com, 9am) (Fred, www.snails.com, 11am) (Amy, www.cnn.com, 8am, 0.9) (Amy, www.snails.com, 9am, 0.4) (Fred, www.snails.com, 11am, 0.4) Group by user (Amy, { (Amy, www.cnn.com, 8am, 0.9), (Amy, www.snails.com, 9am, 0.4) }) (Fred, { (Fred, www.snails.com, 11am, 0.4) }) Transform to (user, Average(pagerank) as avgPR) (Amy, 0.65) (Fred, 0.4) Challenges? Filter avgPR > 0.5 (Amy, 0.65)

Installation Extract Build (ant) Environment variable In pig-0.1.1 and in tutorial dir Environment variable PIGDIR=~/pig-0.1.1 HADOOPSITEPATH=~/hadoop-0.18.3/conf

Running Pig Two modes: Three ways to execute: Local mode Hadoop mode Shell (grunt) Script API (currently Java) GUI (future work)

Running Pig (2) Save data into HDFS Launch shell/Run script bin/hadoop -copyFromLocal excite-small.log excite-small.log Launch shell/Run script java -cp $PIGDIR/pig.jar:$HADOOPSITEPATH org.apache.pig.Main -x mapreduce <script_name> Our script: script1-hadoop.pig

Conclusion For more details: http://hadoop.apache.org/core/ http://wiki.apache.org/hadoop/ http://hadoop.apache.org/pig/ http://wiki.apache.org/pig/