Download presentation
Presentation is loading. Please wait.
Published byKelley Chambers Modified over 6 years ago
1
Created By: Dan Robert and Ronald Richardson II
Crawdaddy Created By: Dan Robert and Ronald Richardson II
2
Motivation Experience Hadoop Map Reduce Program
Distributed Computing Environment Efficiency Speed Reliability Experience Hadoop Hbase Jsoup
3
Project Idea Crawdaddy! A distributed computing web crawler.
Web crawlers URLS->MORE URLS->MORE URLS! Search Engines
4
Dataset Arbitrary data set Initial HBase Table Small ~50 URLS
Next Iteration Larger HBase Table Repeat Over and Over ~2 billion websites
5
Components Input HBase Table Mapper Reducer Driver Output HBase Table
6
Input HBase Table Initial Data Small ~50 URLS
7
Mapper Input: URLs in Hbase table
Webpages will be retrieved using Jsoup Output: Text/BytesWritable URL/Webpage
8
Reducer Input: URL/Webpage Extracts all Urls within Webpage Output:
NULL/Put NULL/New Urls
9
Output HBase Table New URLs Much larger ~20*50
10
Methodology Testing Case 1: Does the mapper return the input urls and webpages? Case 2: Does the reducer return the parsed webpage urls? i.e. using VM Our Strategy Pair programming
11
Conclusion Experience Hbase TableMapper TableReducer Libjars
Distributed Computing Environment Efficiency Speed Reliability
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.