Download presentation
Presentation is loading. Please wait.
Published byRoy Rolf Freeman Modified over 9 years ago
1
Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University
2
Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection
3
Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection
4
Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection
5
Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection
6
Crawling - Recap from last time General procedure : Continuously process a list of URLs and collect respective web pages and links that come along Two problems: Size and frequent changes Page selection : Based on metrics, i.e. - Importance Metric (goal) - Ordering Metric (selection) - Quality Metric (evaluation) Experimental verification with a representative test collection Page refresh : Estimating rate of change: see last lecture (Note: other studies exist, e.g. [5]) Observations: - Frequent changes - Significant differences, e.g. among domains Hence: Update rule necessary
7
3. Page Refresh (Update Rules) Problem : The web is continuously changing Goal : Index and update pages in a way that keeps the index as fresh and young as possible (given the limited resources) Distinguish between Periodic crawlers : Download K pages and stop, repeat this after some time t, and replace old with new collection Incremental crawlers : Continuously crawl the web and incrementally update your collection
8
3.2 Incremental Crawlers Freshness of a page p i at time t Freshness of a local collection P at time t Main Goal : Keep local collection up-to-date Two measures: Freshness and Age
9
3.2 Incremental Crawlers Age of a page p i at time t Age of a local collection P at time t Main Goal : Keep local collection up-to-date Two measures: Freshness and Age
10
3.2 Incremental Crawlers Time average of freshness of page p i at t Time average of freshness of a local collection P at time t ( Time average of age : analogous) Main Goal : Keep local collection up-to-date Two measures: Freshness and Age
11
Example for Freshness and Age 1 0 0 ELEMENT IS CHANGED SYNCHRONIZED AGE FRESHNESS (SOURCE: [6])
12
Design alternative 1: Batch mode vs. steady crawler Batch mode crawler : Periodic update of all pages of a collection Steady crawler : Continuous update BATCH MODE CRAWLERSTEADY CRAWLER FRESHNESS TIME (MONTH) FRESHNESS Note: Assuming a distribution of Poisson, we can prove that the average freshness over time is identical in both cases (for the same average crawling speed!)
13
Design alternative 2: In-place vs. shadowing Replace old with new version of a page in-place or via shadowing, i.e. after all versions of one crawl have been downloaded Shadowing keeps two collections: The crawlers collection and the current collection BATCH MODE CRAWLERSTEADY CRAWLER
14
Design alternative 3: Fixed vs. variable frequency Fixed frequency / uniform refresh policy : Same access rate to all pages (independent of their actual rate of change) Variable frequency : Access pages depending on their rate of change Example: Proportional refresh policy
15
Variable frequency update Obvious assumption for a good strategy: Visit a page that changes frequently more often Wrong!!! The optimum update strategy (if we assume a distribution of Poisson) looks like this: RATE OF CHANGE OF A PAGE OPTIMUM UPDATE TIME
16
Variable frequency update (cont.) Why is this a better strategy? Illustration with a simple example: P 1 P 2
17
Summary of different design alternatives Steady In-place update Variable frequency Batch-mode Shadowing Fixed frequency vs.
18
3.3 Expl. for an Incremental Crawler Two main goals: - Keep the local collection fresh Regular, best-possible updates of the pages in the index - Continuously improve the quality of the collection Replace existing pages with low quality through new pages with higher quality
19
3.3 Expl. for an Incremental Crawler WHILE ( TRUE ) URL = SELECT_TO_CRAWL (ALL_URLS); PAGE = CRAWL (URL); IF (URL IN COLL_URLS) THEN UPDATE (URL, PAGE) ELSE TMP_URL = SELECT_TO_DISCARD (COL_URLS); DISCARD (TMP_URL); SAVE (URL, PAGE); COLL_URLS = (COLL_URLS - {TMP_URL}) U {URL} NEW_URLS = EXTRACT_URLS (PAGE); ALL_URLS = ALL_URLS U NEW_URLS;
20
3.3 Expl. for an Incremental Crawler ALL_URLS COLL_URLS ADD_URLS UPDATE/SAVE COLLECTION RANKING MODULE DISCARDSCAN ADD/REMOVE CRAWL MODULE UPDATE MODULE CRAWL CHECK SUM POP PUSH BACK
21
References - Web Crawler [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 2 (Crawling web pages) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 4.3 (Crawling the web) [3] CHO, GARCIA-MOLINA, PAGE: "EFFICIENT CRAWLING THROUGH URL ORDERING", WWW 1998 [4] CHO, GARCIA-MOLINA: "THE EVOLUTION OF THE WEB AND IMPLICATIONS FOR AN INCREMENTAL CRAWLER", PROCEEDINGS OF THE 26th INTL. CONF. ON VERY LARGE DATA BASES (VLDB 2000) [5] FETTERLY, MANASSE, NAJORK, WIENER: "A LARGE-SCALE STUDY OF THE EVOLUTION OF WEB PAGES", WWW 2003 [6] CHO, GARCIA-MOLINA: "SYNCHRONIZING A DATABASE TO IMPROVE FRESHNESS", ACM SIGMOD 2000
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.