Download presentation
Presentation is loading. Please wait.
Published byGiles Rogers Modified over 8 years ago
1
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel
2
Objective Understand how Pig Latin works Implement an IP Address filter using Apache Pig Implement a similar IP filter using purely Hadoop Comparison & Analysis of the two implementations Conduct a case study of pros and cons of other high-level languages with Pig
3
What is Apache Pig? Platform for analyzing large data sets. Merging data sets, filtering them, and applying functions to records or groups of records Allows you to create user defined functions
4
Pig Infrastructure Mainly consists of two layers, Compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin
5
Nested Data Model Pig Latin has a fully-nestable data model with: Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Computers, Desktops Laptops Netbooks
6
Pig Latin vs. SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL Ease of programming Optimization opportunities Extensibility Pig Latin
7
JOIN vs. COGROUP
8
Using Pig on cloud Pig Latin programs run in a distributed fashion on a cluster Programs are complied into Map/Reduce jobs and executed using Hadoop Pig Latin programs can also run in "local mode" without a cluster
9
Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls
10
Map-Reduce on Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary
11
Implementation cluster Hadoop Map-Reduce Hadoop Map-Reduce Pig SQL automatic rewrite + optimize user
12
IP Filtering Internet companies swimming in data Analyzing of huge data is needed to filter out BOT IP’s A High level language in a cloud environment would be useful to filter out these IP’s efficiently
13
Log Used 2 Months worth of all Http request to NASA Kennedy Space center
14
Data Flow Load Logs Group by ip Foreach ip Generate count Foreach ip Generate count Load Log 2 Join on ip generate the top ip’s ORDER BY count Filter ip based on Threshold
15
EXAMPLE A = LOAD 'input/*' USING PigStorage('\t') AS (ip:chararray); B = GROUP A by ip; C = FOREACH B GENERATE FLATTEN(group),COUNT(A.ip) as count; D = ORDER C BY count; E = FILTER D BY $1>500; F = STORE E INTO 'result'; Lines of Code : 6
16
IP Filtering - Pure Map Reduce Map Reduce Filter IP’s from Log files Compute occurrence of IP’s Sort IP’s based on count Compute Cumulative frequency Filter IP’s above threshold Lines of Code : 130
17
PERFORMANCE ANALYSIS
23
DEMO
24
Pig Vs Hive
25
PROS AND CONS PROS Allows UDF Easy to scale large data Simple user understandable language CONS Does not allow JDBC/ODBC No server
26
Time Line MilestoneSchedule Understand how Pig Latin works Read through the tutorial 11/07/2011 Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations 11/14/2011 Implement IP filter using purely Hadoop and compare it to the Pig implementation 11/28/2011 Conduct a case study on the pros and cons of high level languages12/05/2011 Final Report12/12/2011
27
References A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008); https://cwiki.apache.org/confluence/display/PIG/Index https://cwiki.apache.org/confluence/display/PIG/Index http://pig.apache.org/ http://pig.apache.org/
28
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.