Presentation is loading. Please wait.

Presentation is loading. Please wait.

Created By: Dan Robert and Ronald Richardson II

Similar presentations


Presentation on theme: "Created By: Dan Robert and Ronald Richardson II"— Presentation transcript:

1 Created By: Dan Robert and Ronald Richardson II
Crawdaddy Created By: Dan Robert and Ronald Richardson II

2 Motivation Experience Hadoop Map Reduce Program
Distributed Computing Environment Efficiency Speed Reliability Experience Hadoop Hbase Jsoup

3 Project Idea Crawdaddy! A distributed computing web crawler.
Web crawlers URLS->MORE URLS->MORE URLS! Search Engines

4 Dataset Arbitrary data set Initial HBase Table Small ~50 URLS
Next Iteration Larger HBase Table Repeat Over and Over ~2 billion websites

5 Components Input HBase Table Mapper Reducer Driver Output HBase Table

6 Input HBase Table Initial Data Small ~50 URLS

7 Mapper Input: URLs in Hbase table
Webpages will be retrieved using Jsoup Output: Text/BytesWritable URL/Webpage

8 Reducer Input: URL/Webpage Extracts all Urls within Webpage Output:
NULL/Put NULL/New Urls

9 Output HBase Table New URLs Much larger ~20*50

10 Methodology Testing Case 1: Does the mapper return the input urls and webpages? Case 2: Does the reducer return the parsed webpage urls? i.e. using VM Our Strategy Pair programming

11 Conclusion Experience Hbase TableMapper TableReducer Libjars
Distributed Computing Environment Efficiency Speed Reliability


Download ppt "Created By: Dan Robert and Ronald Richardson II"

Similar presentations


Ads by Google