Pig Installation Guide and Practical Example Presented by Priagung Khusumanegara Prof. Kyungbaek Kim
Installation Guide Requirements Java 1.6 (this example using java-7-openjdk) Hadoop 0.23.x, 1.2.x, or 2.5.x (example using Hadoop 1.2.1)
Configuration Make sure you have installed Hadoop and can run Hadoop correctly Download Pig Stable Version (0.13) $ wget Unpack the downloaded Pig distribution and move it to preferred directory (example using /usr/local/pig/) $ tar -xvzf pig tar.gz $ mv pig /usr/local/pig Edit ~/.bashrc and add the following statement in the last line export PIG_HOME=/usr/local/pig export PATH=$PATH:$PIG_HOME/bin Test the Pig installation with simple command $pig -help
Practical Example Objective : Counting packet length between ip source and ip destination in the network traffic Running Hadoop Download Input files and copy them to HDFS -$ wget -O input.txthttps:// -$ hadoop dfs –copyFromLocal input.txt /input/input.txt Note: get input file using tcpdump : tcpdump -n -i wlan0 >> input.txt
Screenshot Input File (input.txt) Enter grunt $ pig –x mapreduce
Load text file into a bag, stick entire line into element ‘line’ of type ’chararray’ RAW_LOGS = LOAD ‘/input/input.txt ' AS (line:chararray); Apply a schema to raw data LOGS_BASE = FOREACH RAW_LOGS GENERATE FLATTEN( (tuple(CHARARRAY,CHARARRAY,LONG))REGEX_EXTRACT_ALL(line,'.+\\s( \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).+\\s(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).+length\\s+(\\d+)')) AS (IPS:chararray, IPD:chararray, S:long); Group traffic information by source IP addresses and destination IP addresses FLOW = GROUP LOGS_BASE BY (IPS, IPD);
Count the number of packet length by each IP address TRAFFIC = FOREACH FLOW {sorted = ORDER LOG_BASE by S DESC; GENERATE group, SUM(LOGS_BASE.S);} Store output data in HDFS (/output) STORE TRAFFIC INTO '/output';
SCREENSHOT EACH PROCESS
Screenshot Output File