1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Master Thesis Defense Jan Fiedler 04/17/98
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
The Anatomy of a Large-Scale Hypertextual Web Search Engine
IST 497 Vladimir Belyavskiy 11/21/02
Finding replicated web collections
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
CS246 Search Engine Scale.
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Junghoo “John” Cho UCLA
CS246: Search-Engine Scale
Searching for Truth: Locating Information on the WWW
The Search Engine Architecture
Junghoo “John” Cho UCLA
Presentation transcript:

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab

2 What is a crawler? n Program that automatically retrieves pages from the Web. n Widely used for search engines.

3 Challenges n There are many pages out on the Web. (Major search engines indexed more than 100M pages) n The size of the Web is growing enormously. n Most of them are not very interesting  In most cases, it is too costly or not worthwhile to visit the entire Web space.

4 Good crawling strategy n Make the crawler visit “important pages” first. u Save network bandwidth u Save storage space and management cost u Serve quality pages to the client application

5 Outline n Importance metrics : what are important pages? n Crawling models : How is crawler evaluated? n Experiments n Conclusion & Future work

6 Importance metric The metric for determining if a page is HOT u Similarity to driving query u Location Metric u Backlink count u Page Rank

7 Similarity to a driving query n Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page) n Personalized crawler Example) “Sports”, “Bill Clinton” the pages related to a specific topic

8 Importance metric The metric for determining if a page is HOT u Similarity to driving query u Location Metric u Backlink count u Page Rank

9 Backlink-based metric n Backlink count u number of pages pointing to the page u Citation metric n Page Rank u weighted backlink count u weight is iteratively defined

10 A B C D E F BackLinkCount(F) = 2 PageRank(F) = PageRank(E)/2 + PageRank(C)

11 Ordering metric n The metric for a crawler to “estimate” the importance of a page n The ordering metric can be different from the importance metric

12 Crawling models n Crawl and Stop u Keep crawling until the local disk space is full. n Limited buffer crawl u Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

Crawl and stop model

14 Crawling models n Crawl and Stop u Keep crawling until the local disk space is full. n Limited buffer crawl u Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

Limited buffer model

16 Architecture Repository URL selector Virtual Crawler HTML parser URL pool Page Info crawled page extracted URL page info selected URL WebBase Crawler Stanford WWW

17 Experiments n Backlink-based importance metric u backlink count u PageRank n Similiarty-based importance metric u similarity to a query word

18 Ordering metrics in experiments n Breadth first order n Backlink count n PageRank

20 Similarity-based crawling n The content of the page is not available before it is visited n Essentially, the crawler should “guess” the content of the page n More difficult than backlink-based crawling

21 Promising page Sports ? Anchor Text Sports!! ? HOT Parent Page ? URL …/sports.html

22 Virtual crawler for similarity-based crawling Promising page u Query word appears in its anchor text u Query word appears in its URL u The page pointing to it is “important” page n Visit “promising pages” first n Visit “non-promising pages” in the ordering metric order

24 Conclusion n PageRank is generally good as an ordering metric. n By applying a good ordering metric, it is possible to gather important pages quickly.

25 Future work n Limited buffer crawling model n Replicated page detection n Consistency maintenance

26 Problem n In what order should a crawler visit web pages to get the pages we want? n How can we get important pages first?

27 WebBase n System for creating and maintaining large local repository n High index speed (50 pages/sec) and large repository (150GB) n Load balancing scheme to prevent servers from crashing

28 Virtual Web crawler n The crawler for experiments n Run on top of the WebBase repository n No load balancing n Dataset was restricted to Stanford domain

29 Available Information n Anchor text n URL of the page n The content of the page pointing to it