Download presentation
Presentation is loading. Please wait.
Published byToby Warren Modified over 9 years ago
1
Anushree Venkatesh Sagar Mehta Sushma Rao
2
Motivation What is Map-Reduce? Why Map-Reduce? The HADOOP Framework Map Reduce in SILOs SILOs Architecture Modules Experiments
3
Life span of a web page – 44 to 75 days Limitations of centralized/distributed crawling Exploring map reduce Analysis of web [ subset ] Web graph Web graph Search response quality Tweaked page rank Inverted Index
4
Courtesy: http://www.aharef.info/2006/05/websites_as_graphs.htm
6
Divide and conquer Functional programming counterparts -> distributed data processing Plumbing behind the scenes -> Focus on the problem Map – Division of key space Reduce – Combine results Pipelining functionality
7
Open source implementation of Map reduce in Java HDFS – Hadoop specific file system Takes care of fault tolerance dependencies between nodes Setup through VM instance - Problems
8
Currently Single Node cluster HDFS Setup Incorporation of Berkeley DB
11
Seed List Compression M Parse for URL M Parse for URL R URL, 1 (Remove Duplicate s) R URL, 1 (Remove Duplicate s) URL Extractor M Parse for key word M Parse for key word R KeyWord, URL R KeyWord, URL Key Word Extractor Page Content Table Inverted Index Table M URL, value M URL, value R URL, page content R URL, page content Distributed Crawler M Parent, URL M Parent, URL R URL, Parent R URL, Parent Back Links Mapper Back Links Table Adjacency List Table Diff URL Table Graph Builder
13
Map Input if(!duplicate(URL)) { Insert into url_table Page_content = http_get(url); Output Intermediate pair } Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) { Page_content = http_get(url); Update url table(hash(url),current_time); Output Intermediate pair } Else { Update url table(hash(url),current_time); } Reduce Input If(! Exits hash(URL) in page content table) { Insert into page_content_table } Else if(hash(page_content_table(hash(url)) != hash(current_page_content) { Insert into page_content_table }
14
Currently outside of Map-Reduce Manual transfer of files to HDFS Currently Depth First Search, will be modified for Breadth First Search
15
Map Input List = parse(page_content); For each keyword, emit Output Intermediate pair Reduce Combine all pairs with the same keyword to emit > Insert into inverted index table >
16
Top Words Along with their Frequency CMU Carnegie 2456 Mellon 2107 University 1157 Alumni 786 Center 466 News 395 Library 393 PA 373 Research 357 Pittsburgh, 352 Information 313 School 309 Cornell Cornell 742 University 378 College 158 Admissions 128 Research99 Student 94 School 89 Information77 York 74 Alumni 71 Academics62 Ithaca 59 Gatech Tech 2704 Georgia 1882 Alumni 1115 Services 885 Association 646 Career 493 Baseball 416 Engineering 408 Tennis 222 Information 219 students 198 Institute 173 Atlanta 164
19
Top 6 URL domains that get traversed CMU alumni.cmu.edu 92 hr.web.cmu.edu 13 www.alumniconnecti ons.com 16 www.carnegiemellon today.com 10 www.cmu.edu 170 www.library.cmu.edu 69 Cornell www.cornell.edu 43 www.cuinfo.cornell.ed u 2 www.gradschool.corne ll.edu 2 www.news.cornell.edu 7 www.sce.cornell.edu 8 www.vet.cornell.edu 1 Gatech centennial.gtalumni.org 4 cyberbuzz.gatech.e du 7 georgiatech.searche ase.com 9 gtalumni.org 236 ramblinwreck.cstv.c om 56 www.gatech.edu14
20
Avg URL Depth CMU cmu.edu 2.73 alumni.cmu.edu 2.18 www.library.cmu.edu 2.23 www.alumniconnecti ons.com 4.81 Cornell cornell.edu 1.34 www.gradschool.corne ll.edu 1 www.news.cornell.edu 2.57 www.sce.cornell.edu 1 Gatech gatech.edu 1 gtalumni.org 3 ramblinwreck.cstv.c om 2.57 cyberbuzz.gatech.e du 2
21
21 Questions, Comments, Criticisms
22
HTML Parser Hadoop Framework (Apache) Peer Crawl
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.