Download presentation
Presentation is loading. Please wait.
Published byKevin Turner Modified over 9 years ago
1
Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
2
Installation Guide Requirements Java 1.6 (this example using java-7-openjdk) Hadoop 0.23.x, 1.2.x, or 2.5.x (example using Hadoop 1.2.1)
3
Configuration Make sure you have installed Hadoop and can run Hadoop correctly Download Pig Stable Version (0.13) $ wget http://apache.tt.co.kr/pig/pig-0.13.0/pig-0.13.0.tar.gzhttp://apache.tt.co.kr/pig/pig-0.13.0/pig-0.13.0.tar.gz Unpack the downloaded Pig distribution and move it to preferred directory (example using /usr/local/pig/) $ tar -xvzf pig-0.13.0.tar.gz $ mv pig-0.13.0 /usr/local/pig Edit ~/.bashrc and add the following statement in the last line export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin Test the Pig installation with simple command $pig -help
4
Practical Example Objective : Counting packet length between ip source and ip destination in the network traffic Running Hadoop Download Input files and copy them to HDFS -$ wget https://www.dropbox.com/s/k6li67bha12geet/input.txt?dl=1 -O input.txthttps://www.dropbox.com/s/k6li67bha12geet/input.txt?dl=1 -$ hadoop dfs –copyFromLocal input.txt /input/input.txt Note: get input file using tcpdump : tcpdump -n -i wlan0 >> input.txt
5
Screenshot Input File (input.txt) Enter grunt $ pig –x mapreduce
6
Load text file into a bag, stick entire line into element ‘line’ of type ’chararray’ RAW_LOGS = LOAD ‘/input/input.txt ' AS (line:chararray); Apply a schema to raw data LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( (tuple(CHARARRAY,CHARARRAY,LONG))REGEX_EXTRACT_ALL(line,'.+\\s( \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).+\\s(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).+length\\s+(\\d+)')) AS (IPS:chararray, IPD:chararray, S:long); Group traffic information by source IP addresses and destination IP addresses FLOW = GROUP LOGS_BASE BY (IPS, IPD);
7
Count the number of packet length by each IP address TRAFFIC = FOREACH FLOW {sorted = ORDER LOG_BASE by S DESC; GENERATE group, SUM(LOGS_BASE.S);} Store output data in HDFS (/output) STORE TRAFFIC INTO '/output';
8
SCREENSHOT EACH PROCESS
11
Screenshot Output File
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.