Created By: Dan Robert and Ronald Richardson II Crawdaddy Created By: Dan Robert and Ronald Richardson II
Motivation Experience Hadoop Map Reduce Program Distributed Computing Environment Efficiency Speed Reliability Experience Hadoop Hbase Jsoup
Project Idea Crawdaddy! A distributed computing web crawler. Web crawlers URLS->MORE URLS->MORE URLS! Search Engines
Dataset Arbitrary data set Initial HBase Table Small ~50 URLS Next Iteration Larger HBase Table Repeat Over and Over ~2 billion websites
Components Input HBase Table Mapper Reducer Driver Output HBase Table
Input HBase Table Initial Data Small ~50 URLS
Mapper Input: URLs in Hbase table Webpages will be retrieved using Jsoup Output: Text/BytesWritable URL/Webpage
Reducer Input: URL/Webpage Extracts all Urls within Webpage Output: NULL/Put NULL/New Urls
Output HBase Table New URLs Much larger ~20*50
Methodology Testing Case 1: Does the mapper return the input urls and webpages? Case 2: Does the reducer return the parsed webpage urls? i.e. using VM Our Strategy Pair programming
Conclusion Experience Hbase TableMapper TableReducer Libjars Distributed Computing Environment Efficiency Speed Reliability