Apache PIG rev 2 2014-05-27. Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Apache PIG rev 2 2014-05-27

Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive

Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world – Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin – Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations

Pig Elements Pig Latin – High-level scripting language – Requires no metadata or schema – Statements translated into a series of MapReduce jobs Grunt – Interactive shell Piggybank – Shared repository for User Defined Functions

Pig Latin Language for expressing data analysis and transformation processes Supports many traditional data operations – join, sort, filter, etc. Simplifies joining data and chaining jobs together

Pig Data Flow INPUT – LOAD From HDFS or Hcatalog TRANSFORM – With Pig Latin expressions OUTPUT – DUMP to console or STORE to HDFS

Pig Latin Execution The Pig interpreter immediately processes each entry If a statement is valid, it gets added to a logical plan built by the interpreter The steps in the plan do not execute in MapReduce until a DUMP or STORE command

Pig Latin Basic Concepts Structures – Field: Single piece of data – Tuple: Ordered set of fields (01234, 5.0, ABC) – Bag: Collection of tuples {(01234, 5.0, ABC), (44234, 12.2, DFE), (0124, 0.2, ABC)} Relational database equivalents – Fields = Fields – Tuple = Row – Bag ≅ Table (does not require all tuples to have same fields)

Pig Example Real example of a Pig script used at Twitter The Java equivalent…

Pig Commands users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray); Loading datasets from HDFS

Pig Commands users_1825 = filter users by age>=18 and age<=25; Filtering data

Pig Commands joined = join users_1825 by username, pages by username; Join datasets

Pig Commands grouped = group joined by url; Group records Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})

Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;

Pig Commands Sort a dataset sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;

Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';

Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';

Exercise: Running the HDP Tutorials http://hortonworks.com/hadoop- tutorial/how-to-use-basic-pig-commands/ http://hortonworks.com/hadoop- tutorial/how-to-process-data-with-apache- pig/ – It won’t work, find out why… (read notes for solution)

Pig Local Execution Mode Executes in a single JVM rather than on a cluster Works exclusively with local file system Great for development, debugging, experimentation and prototyping

Example: Remove header from a CSV file

Apache PIG rev 2 2014-05-27. Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Similar presentations

Presentation on theme: "Apache PIG rev 2 2014-05-27. Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Apache PIG rev 2 2014-05-27. Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

Similar presentations

Presentation on theme: "Apache PIG rev 2 2014-05-27. Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive."— Presentation transcript:

Similar presentations

About project

Feedback