Presented by Priagung Khusumanegara Prof. Kyungbaek Kim APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Agenda Introducing Pig Pig Characteristics Pig Element Pig Latin Foundation Data Flow Pig Feature Data Types Pig Operator and Function
Pig Characteristics A platform for analyzing large data sets that runs on top Hadoop Provides a high-level language for expressing data analysis Uses both HDFS (read and write files) and MapReduce (execute jobs)
Pig Elements Pig Latin High-level scripting language Designed specifically for data transformation and flow expression Grunt The environment in which Pig Latin commands are executed Currently there is support for Local and Hadoop modes. Pig Interpreter Pig interpreter converts Pig Latin to MapReduce
Pig Latin Data Flow A LOAD statement to read data from the file system. A series of "transformation" statements to process the data. A DUMP statement to view results or a STORE statement to save the results. LOAD TRANSFORM DUMP OR STORE
Running Pig Script - Execute commands in a file - $ pig scriptFile.pig Grunt - Interactive shell for executing Pig Commands - Started when script file is NOT provided
Running Modes Local Executes in a single JVM Works exclusively with local file system Great for development, experimentation and prototyping Hadoop Mode Also known as MapReduce mode Pig renders Pig Latin into MapReduce jobs and executes them on the cluster Can execute against pseudo-distributed or fully distributed
Running Modes $pig -x local $pig -x mapreduce
Hadoop Mode
Pig Relation Pig Latin statements work with relation A field is a piece of data 19 A tuple is an ordered set of fields (19,2) A bag is a collection of unordered tuples {(19,2), (18,1)} A relation is a bag Field Tuple Field Field Bag
Data Type Data Type int Description Signed 32-bit integer Example 10 long Signed 64-bit integer Data: 10L or 10l Display: 10L float 32-bit floating point Data: 10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or 1050.0F double 64-bit floating point Data: 10.5 or 10.5e2 or 10.5E2 Display: 10.5 or 1050.0 chararray Character array (string) in Unicode UTF-8 format hello world boolean true/false (case insensitive) datetime 1970-01-01T00:00:00.000+00:00
LOAD operator schema Load contents of text files into a bag names data
DUMP and STORE operator No action is taken until DUMP or STORE commands are encountered Pig will parse, validate and analyzed statements but not execute them DUMP – display the results to screen STORE – save results to a file
DUMP and STORE operator DUMP Example STORE Example
FILTER and GROUP operator Filter the data bag Group bag filtered by score
ORDER operator Note: For descending order Sorted = ORDER data BY score DESC;
FOREACH operator For each row emit score, status fields
DISTINCT operator Remove duplicate tuples in bag
UNION operator Merge the contents of two or more bags
JOIN operator Bag data1 and data2 are joined by their first fields.
SUM, MIN, AVG Function Note: find min value : MIN find sum value : SUM find average value : AVG