Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Similar presentations


Presentation on theme: "Presented by Priagung Khusumanegara Prof. Kyungbaek Kim"— Presentation transcript:

1 Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
APACHE PIG Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

2 Agenda Introducing Pig Pig Characteristics Pig Element
Pig Latin Foundation Data Flow Pig Feature Data Types Pig Operator and Function

3 Pig Characteristics A platform for analyzing large data sets that runs on top Hadoop Provides a high-level language for expressing data analysis Uses both HDFS (read and write files) and MapReduce (execute jobs)

4 Pig Elements Pig Latin High-level scripting language
Designed specifically for data transformation and flow expression Grunt The environment in which Pig Latin commands are executed Currently there is support for Local and Hadoop modes. Pig Interpreter Pig interpreter converts Pig Latin to MapReduce

5 Pig Latin Data Flow A LOAD statement to read data from the file system. A series of "transformation" statements to process the data. A DUMP statement to view results or a STORE statement to save the results. LOAD TRANSFORM DUMP OR STORE

6 Running Pig Script - Execute commands in a file - $ pig scriptFile.pig
Grunt - Interactive shell for executing Pig Commands - Started when script file is NOT provided

7 Running Modes Local Executes in a single JVM
Works exclusively with local file system Great for development, experimentation and prototyping Hadoop Mode Also known as MapReduce mode Pig renders Pig Latin into MapReduce jobs and executes them on the cluster Can execute against pseudo-distributed or fully distributed

8 Running Modes $pig -x local $pig -x mapreduce

9 Hadoop Mode

10 Pig Relation Pig Latin statements work with relation
A field is a piece of data  19 A tuple is an ordered set of fields (19,2) A bag is a collection of unordered tuples  {(19,2), (18,1)} A relation is a bag Field Tuple Field Field Bag

11 Data Type Data Type int Description Signed 32-bit integer Example 10
long Signed 64-bit integer Data:     10L or 10l Display: 10L float 32-bit floating point Data:     10.5F or 10.5f or 10.5e2f or 10.5E2F Display: 10.5F or F double 64-bit floating point Data:     10.5 or 10.5e2 or 10.5E2 Display: 10.5 or chararray Character array (string) in Unicode UTF-8 format hello world boolean true/false (case insensitive) datetime T00:00: :00

12 LOAD operator schema Load contents of text files into a bag names data

13 DUMP and STORE operator
No action is taken until DUMP or STORE commands are encountered Pig will parse, validate and analyzed statements but not execute them DUMP – display the results to screen STORE – save results to a file

14 DUMP and STORE operator
DUMP Example STORE Example

15 FILTER and GROUP operator
Filter the data bag Group bag filtered by score

16 ORDER operator Note: For descending order
Sorted = ORDER data BY score DESC;

17 FOREACH operator For each row emit score, status fields

18 DISTINCT operator Remove duplicate tuples in bag

19 UNION operator Merge the contents of two or more bags

20 JOIN operator Bag data1 and data2 are joined by their first fields.

21 SUM, MIN, AVG Function Note: find min value : MIN find sum value : SUM
find average value : AVG


Download ppt "Presented by Priagung Khusumanegara Prof. Kyungbaek Kim"

Similar presentations


Ads by Google