Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Similar presentations


Presentation on theme: "Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel."— Presentation transcript:

1 Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel

2 What is Apache Pig? Platform for analyzing large data sets. Merging data sets, filtering them, and applying functions to records or groups of records Allows you to create user defined functions

3 Pig Infrastructure Mainly consists of two layers, Compiler that produces sequences of Map-Reduce programs Pig's language layer currently consists of a textual language called Pig Latin

4 Nested Data Model Pig Latin has a fully-nestable data model with: ◦Atomic values, tuples, bags (lists), and maps More natural to programmers than flat tuples Avoids expensive joins Computers, Desktops Laptops Netbooks

5 Pig Latin vs. SQL Little control over execution method Query optimization is hard Parallel environment Little or no statistics Lots of UDFs SQL Ease of programming Optimization opportunities Extensibility Pig Latin

6 JOIN vs. COGROUP

7 Using Pig on cloud Pig Latin programs run in a distributed fashion on a cluster Programs are complied into Map/Reduce jobs and executed using Hadoop Pig Latin programs can also run in "local mode" without a cluster

8 Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls

9 Map-Reduce on Data Flow Load Visits Group by url Foreach url generate count Foreach url generate count Load Url Info Join on url Group by category Foreach category generate top10 urls Foreach category generate top10 urls Map 1 Reduce 1 Map 2 Reduce 2 Map 3 Reduce 3 Every group or join operation forms a map-reduce boundary

10 Implementation cluster Hadoop Map-Reduce Hadoop Map-Reduce Pig SQL automatic rewrite + optimize user

11 IP Filtering  Internet companies swimming in data  Analyzing of huge data is needed to filter out BOT IP’s  A High level language in a cloud environment would be useful to filter out these IP’s efficiently

12 Objective  Understand how Pig Latin works  Implement an IP Address filter using Apache Pig  Implement a similar IP filter using purely Hadoop  Comparison & Analysis of the two implementations  Conduct a case study of pros and cons of other high-level languages with Pig

13 Time Line MilestoneSchedule Understand how Pig Latin works Read through the tutorial 11/07/2011 Implement IP filter using Apache Pig and perform analysis to figure out best scenarios for specific optimizations 11/14/2011 Implement IP filter using purely Hadoop and compare it to the Pig implementation 11/28/2011 Conduct a case study on the pros and cons of high level languages12/05/2011 Final Report12/12/2011

14 References  A. Pig Latin: A not-so-foreign language for data processing. In Proceedings of the ACM SIGMOD 2008 International Conference on Management of Data (Auckland, New Zealand, June 2008);  https://cwiki.apache.org/confluence/display/PIG/ Index https://cwiki.apache.org/confluence/display/PIG/ Index  http://pig.apache.org/ http://pig.apache.org/

15 Thank you


Download ppt "Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel."

Similar presentations


Ads by Google