Download presentation
Presentation is loading. Please wait.
Published byEustacia Daniels Modified over 9 years ago
1
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel
2
What is Apache Pig? Platform for analyzing large data sets. Merging data sets, filtering them, and applying functions to records or groups of records Allows you to create user defined functions
3
Pig Infrastructure Mainly consists of two layers, Compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin
4
Nested Data Model Pig Latin has a fully-nestable data model with: ◦Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Computers, Desktops Laptops Netbooks
5
Pig Latin vs. SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL Ease of programming Optimization opportunities Extensibility Pig Latin
6
JOIN vs. COGROUP
7
Using Pig on cloud Pig Latin programs run in a distributed fashion on a cluster Programs are complied into Map/Reduce jobs and executed using Hadoop Pig Latin programs can also run in "local mode" without a cluster
8
Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls
9
Map-Reduce on Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary
10
Implementation cluster Hadoop Map-Reduce Hadoop Map-Reduce Pig SQL automatic rewrite + optimize user
11
IP Filtering Internet companies swimming in data Analyzing of huge data is needed to filter out BOT IP’s A High level language in a cloud environment would be useful to filter out these IP’s efficiently
12
Objective Understand how Pig Latin works Implement an IP Address filter using Apache Pig Implement a similar IP filter using purely Hadoop Comparison & Analysis of the two implementations Conduct a case study of pros and cons of other high-level languages with Pig
13
Time Line MilestoneSchedule Understand how Pig Latin works Read through the tutorial 11/07/2011 Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations 11/14/2011 Implement IP filter using purely Hadoop and compare it to the Pig implementation 11/28/2011 Conduct a case study on the pros and cons of high level languages12/05/2011 Final Report12/12/2011
14
References A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008); https://cwiki.apache.org/confluence/display/PIG/ Index https://cwiki.apache.org/confluence/display/PIG/ Index http://pig.apache.org/ http://pig.apache.org/
15
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.