Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture.

Similar presentations


Presentation on theme: "Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture."— Presentation transcript:

1 Anushree Venkatesh Sagar Mehta Sushma Rao

2  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture  Modules  Experiments

3  Life span of a web page – 44 to 75 days  Limitations of centralized/distributed crawling  Exploring map reduce  Analysis of web [ subset ]  Web graph Web graph  Search response quality Tweaked page rank Inverted Index

4 Courtesy: http://www.aharef.info/2006/05/websites_as_graphs.htm

5

6  Divide and conquer  Functional programming counterparts -> distributed data processing  Plumbing behind the scenes -> Focus on the problem  Map – Division of key space  Reduce – Combine results  Pipelining functionality

7  Open source implementation of Map reduce in Java  HDFS – Hadoop specific file system  Takes care of  fault tolerance  dependencies between nodes  Setup through VM instance - Problems

8  Currently Single Node cluster  HDFS Setup  Incorporation of Berkeley DB

9

10

11 Seed List Compression M Parse for URL M Parse for URL R URL, 1 (Remove Duplicate s) R URL, 1 (Remove Duplicate s) URL Extractor M Parse for key word M Parse for key word R KeyWord, URL R KeyWord, URL Key Word Extractor Page Content Table Inverted Index Table M URL, value M URL, value R URL, page content R URL, page content Distributed Crawler M Parent, URL M Parent, URL R URL, Parent R URL, Parent Back Links Mapper Back Links Table Adjacency List Table Diff URL Table Graph Builder

12

13 Map Input if(!duplicate(URL)) { Insert into url_table Page_content = http_get(url); Output Intermediate pair } Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold) { Page_content = http_get(url); Update url table(hash(url),current_time); Output Intermediate pair } Else { Update url table(hash(url),current_time); } Reduce Input If(! Exits hash(URL) in page content table) { Insert into page_content_table } Else if(hash(page_content_table(hash(url)) != hash(current_page_content) { Insert into page_content_table }

14  Currently outside of Map-Reduce  Manual transfer of files to HDFS  Currently Depth First Search, will be modified for Breadth First Search

15 Map Input List = parse(page_content); For each keyword, emit Output Intermediate pair Reduce Combine all pairs with the same keyword to emit > Insert into inverted index table >

16 Top Words Along with their Frequency CMU Carnegie 2456 Mellon 2107 University 1157 Alumni 786 Center 466 News 395 Library 393 PA 373 Research 357 Pittsburgh, 352 Information 313 School 309 Cornell Cornell 742 University 378 College 158 Admissions 128 Research99 Student 94 School 89 Information77 York 74 Alumni 71 Academics62 Ithaca 59 Gatech Tech 2704 Georgia 1882 Alumni 1115 Services 885 Association 646 Career 493 Baseball 416 Engineering 408 Tennis 222 Information 219 students 198 Institute 173 Atlanta 164

17

18

19 Top 6 URL domains that get traversed CMU alumni.cmu.edu 92 hr.web.cmu.edu 13 www.alumniconnecti ons.com 16 www.carnegiemellon today.com 10 www.cmu.edu 170 www.library.cmu.edu 69 Cornell www.cornell.edu 43 www.cuinfo.cornell.ed u 2 www.gradschool.corne ll.edu 2 www.news.cornell.edu 7 www.sce.cornell.edu 8 www.vet.cornell.edu 1 Gatech centennial.gtalumni.org 4 cyberbuzz.gatech.e du 7 georgiatech.searche ase.com 9 gtalumni.org 236 ramblinwreck.cstv.c om 56 www.gatech.edu14

20 Avg URL Depth CMU cmu.edu 2.73 alumni.cmu.edu 2.18 www.library.cmu.edu 2.23 www.alumniconnecti ons.com 4.81 Cornell cornell.edu 1.34 www.gradschool.corne ll.edu 1 www.news.cornell.edu 2.57 www.sce.cornell.edu 1 Gatech gatech.edu 1 gtalumni.org 3 ramblinwreck.cstv.c om 2.57 cyberbuzz.gatech.e du 2

21 21 Questions, Comments, Criticisms

22  HTML Parser  Hadoop Framework (Apache)  Peer Crawl

23


Download ppt "Anushree Venkatesh Sagar Mehta Sushma Rao.  Motivation  What is Map-Reduce?  Why Map-Reduce?  The HADOOP Framework  Map Reduce in SILOs  SILOs Architecture."

Similar presentations


Ads by Google