How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.

Slides:

Advertisements

Similar presentations

Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.

Advertisements

Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.

Crawling the WEB Representation and Management of Data on the Internet.

Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:

Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.

1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.

1 Searching the Web Representation and Management of Data on the Internet.

Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.

CS345 Data Mining Crawling the Web. Web Crawling Basics get next url get page extract urls to visit urls visited urls web pages Web Start with a “seed.

Crawling The Web. Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,

WEB CRAWLERs Ms. Poonam Sinai Kenkre.

Microsoft ® Official Course Developing Optimized Internet Sites Microsoft SharePoint 2013 SharePoint Practice.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.

The Evolution of the Web and Implications for an Incremental Crawler Junghoo Cho Stanford University.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.

Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.

Crawlers - March (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

Crawling Slides adapted from

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching.

Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University

1 Crawling The Web. 2 Motivation By crawling the Web, data is retrieved from the Web and stored in local repositories Most common example: search engines,

1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.

استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.

1 Stanford InterLib Technologies Hector Garcia-Molina and the Stanford DigLib Team.

Curtis Spencer Ezra Burgoyne An Internet Forum Index.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

1 Searching the Web Representation and Management of Data on the Internet.

Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,

What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.

Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.

Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.

WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,

1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

1 What’s New on the Web? The Evolution of the Web from a Search Engine Perspective A. Ntoulas, J. Cho, and C. Olston, the 13 th International World Wide.

Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철

1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.

Old Dominion University Feburary 1st, 2005

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

IST 497 Vladimir Belyavskiy 11/21/02

CS246 Page Refresh.

Yoram Bachrach Yiftah Ben-Aharon

Robotic Search Engines for the Physical World

CS246 Search Engine Scale.

Junghoo “John” Cho UCLA

CS246: Search-Engine Scale

Presentation transcript:

How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho

Economic Concerns Information Loss Information Overload Service Heterogeneity Physical Barriers Interoperability Value Filtering Mobile Access IP Infrastructure Archival Repository Stanford InterLib Technologies

Repository Multicast Engine WWW Feature Repository Retrieval Indexes Webbase API Web Crawler Web Crawler Web Crawler Web Crawlers Client WebBase Architecture

4 What is a Crawler? web init get next url get page extract urls initial urls to visit urls visited urls web pages

5 Crawling at Stanford WebBase Project BackRub Search Engine, Page Rank Google New WebBase Crawler

6 Crawling Issues (1) Load at visited web sites –robots.txt files –space out requests to a site –limit number of requests to a site per day –limit depth of crawl –multicast

7 Crawling Issues (2) Load at crawler –parallelize init get next url get page extract urls initial urls to visit urls visited urls web pages init get next url get page extract urls ?

8 Crawling Issues (3) Scope of crawl –not enough space for “all” pages –not enough time to visit “all” pages Solution: Visit “important” pages visited pages microsoft

9 Crawling Issues (4) Incremental crawling –how do we keep pages “fresh”? –how do we avoid crawling from scratch? Focus of this talk...

10 Web Evolution Experiment How often does a web page change? What is the lifespan of a page? How long does it take for 50% of the web to change? How do we model web changes?

11 Experimental Setup February 17 to June 24, sites visited (with permission) –identified 400 sites with highest “page rank” –contacted administrators 720,000 pages collected –3,000 pages from each site daily –start at root, visit breadth first (get new & old pages) –ran only 9pm - 6am, 10 seconds between site requests

12 How Often Does a Page Change? Example: 50 visits to page, 5 changes  average change interval = 50/5 = 10 days Is this correct? 1 day changes page visited

13 Average Change Interval fraction of pages

14 Average Change Interval — By Domain fraction of pages

15 How Long Does a Page Live? experiment duration page lifetime experiment duration page lifetime experiment duration page lifetime experiment duration page lifetime

16 Page Lifespans fraction of pages

17 Page Lifespans Method 1 used fraction of pages

18 Time for a 50% Change days fraction of unchanged pages

19 Modeling Web Evolution Poisson process with rate T is time to next event f T (t) = e - t (t > 0)

20 Change Interval of Pages for pages that change every 10 days on average interval in days fraction of changes with given interval Poisson model

21 Change Metrics Freshness –Freshness of element e i at time t is F( e i ; t ) = 1 if e i is up-to-date at time t 0 otherwise eiei eiei... webdatabase –Freshness of the database S at time t is F( S ; t ) = F( e i ; t )  N 1 N i=1

22 Change Metrics Age –Age of element e i at time t is A( e i ; t ) = 0 if e i is up-to-date at time t t - (modification e i time) otherwise eiei eiei... webdatabase –Age of the database S at time t is A( S ; t ) = A( e i ; t )  N 1 N i=1

23 Change Metrics F(e i ) A(e i ) time update refresh F( S ) = lim F(S ; t ) dt  t 1 t 0 t  F( e i ) = lim F(e i ; t ) dt  t 1 t 0 t  Time averages: similar for age...

24 Crawler Types In-place vs. shadow Steady vs. batch eiei eiei... webdatabase eiei... shadow database time crawler on crawler off

25 Comparison: Batch vs. Steady batch mode in-place crawler steady in-place crawler crawler running

26 Shadowing Steady Crawler crawler’s collection current collection without shadowing

27 Shadowing Batch Crawler crawler’s collection current collection without shadowing

28 Experimental Data  Freshness Pages change on average every 4 months Batch crawler works one week out of

29 Refresh Order Fixed order –Example: Explicit list of URLs to visit Random Order –Example: Start from seed URLs & follow links Purely Random –Example: Refresh pages on demand, as requested by user eiei eiei... web database

30 Freshness vs. Order steady, in-place crawler, pages change at same average rate r = / f = average change frequency / average visit frequency

31 Age vs. Order r = / f = average change frequency / average visit frequency = Age / time to refresh all N elements

32 Non-Uniform Change Rates fraction of elements g( )

33 Trick Question Two page database e 1 changes daily e 2 changes once a week Can visit pages once a week How should we visit pages? –e 1 e 1 e 1 e 1 e 1 e 1... –e 2 e 2 e 2 e 2 e 2 e 2... –e 1 e 2 e 1 e 2 e 1 e 2... [uniform] –e 1 e 1 e 1 e 1 e 1 e 1 e 2 e 1 e 1... [proportional] –? e1e1 e2e2 e1e1 e2e2 web database

34 Proportional Often Not Good! Visit fast changing e 1  get 1/2 day of freshness Visit slow changing e 2  get 1/2 week of freshness Visiting e 2 is a better deal!

35 Selecting Optimal Refresh Frequency Analysis is complex Shape of curve is the same in all cases Holds for any distribution g( )

36 Optimal Refresh Frequency for Age Analysis is also complex Shape of curve is the same in all cases Holds for any distribution g( )

37 Comparing Policies In-place, steady crawler; Based on our experimental data [Pages change at different frequencies, as measured in experiment.]

38 Building an Incremental Crawler Steady In-place Variable visit frequencies Fixed order improves freshness! Also need: –Page change frequency estimator –Page replacement policy –Page ranking policy (what are ‘important” pages?) See papers....

39 The End Thank you for your attention... For more information, visit Follow “Searchable Publications” link, and search for author = “Cho”