Download presentation
Presentation is loading. Please wait.
Published byCharlotte Lawrence Modified over 9 years ago
1
Apache PIG rev 2 2014-05-27
2
Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive
3
Apache Pig Tool for querying data on Hadoop clusters Widely used in the Hadoop world – Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts Allows to write data manipulation scripts written in a high-level language called Pig Latin – Interpreted language: scripts are translated into MapReduce jobs Mainly targeted at joins and aggregations
4
Pig Elements Pig Latin – High-level scripting language – Requires no metadata or schema – Statements translated into a series of MapReduce jobs Grunt – Interactive shell Piggybank – Shared repository for User Defined Functions
5
Pig Latin Language for expressing data analysis and transformation processes Supports many traditional data operations – join, sort, filter, etc. Simplifies joining data and chaining jobs together
6
Pig Data Flow INPUT – LOAD From HDFS or Hcatalog TRANSFORM – With Pig Latin expressions OUTPUT – DUMP to console or STORE to HDFS
7
Pig Latin Execution The Pig interpreter immediately processes each entry If a statement is valid, it gets added to a logical plan built by the interpreter The steps in the plan do not execute in MapReduce until a DUMP or STORE command
8
Pig Latin Basic Concepts Structures – Field: Single piece of data – Tuple: Ordered set of fields (01234, 5.0, ABC) – Bag: Collection of tuples {(01234, 5.0, ABC), (44234, 12.2, DFE), (0124, 0.2, ABC)} Relational database equivalents – Fields = Fields – Tuple = Row – Bag ≅ Table (does not require all tuples to have same fields)
9
Pig Example Real example of a Pig script used at Twitter The Java equivalent…
10
Pig Commands users = load 'Users.csv' using PigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' using PigStorage(',') as (username: chararray, url: chararray); Loading datasets from HDFS
11
Pig Commands users_1825 = filter users by age>=18 and age<=25; Filtering data
12
Pig Commands joined = join users_1825 by username, pages by username; Join datasets
13
Pig Commands grouped = group joined by url; Group records Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})
14
Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;
15
Pig Commands Sort a dataset sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;
16
Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';
17
Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';
18
Exercise: Running the HDP Tutorials http://hortonworks.com/hadoop- tutorial/how-to-use-basic-pig-commands/ http://hortonworks.com/hadoop- tutorial/how-to-process-data-with-apache- pig/ – It won’t work, find out why… (read notes for solution)
19
Pig Local Execution Mode Executes in a single JVM rather than on a cluster Works exclusively with local file system Great for development, debugging, experimentation and prototyping
20
Example: Remove header from a CSV file
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.