Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA

Freshness Policy Freshness policy also known as Revisit policy is the process of determining the order and time to re-crawl the web pages by any crawler. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates and deletions which will make the crawled data out-of-date. In order to display latest results to the user search engine must have an efficient revisit policy. An efficient revisit policy will not only save time and bandwidth but also keep search engines data up-to-date.

Metrics for evaluation of Freshness Policy Two metrics for determining how up to date a site is can be described as follows: Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a pagepin the repository at timetis defined as: Age: This is a measure that indicates how outdated the local copy is. The age of a page in the repository, at time is defined as:

Methodology Tracked over 90 sites over a period of 2 weeks. We divided them into 4 categories: Movies Technology Education News Sites selected based on Alexa traffic Rankings – Rohan and Binoy Developed crawler in Java to download original as well as cached version of Google and Bing for each web page twice a day – Binoy and Rohan Implemented our own code to extract date and time from the cache for each web page - Rohan Implemented our own Diff functionality to detect changes in a web page over a period of time which ignored html tags and scripts and considered data between the tags – Madhura Data Integration – Madhura Data Analysis – Binoy, Rohan and Madhura Study of Nutch Adaptive Fetch Policy - Binoy

NUTCH 1.2 Setup Installed Nutch with Lucene on local machine for crawling Settings used for Nutch Crawling db.fetch.interval.default 172800 db.fetch.schedule.class org.apache.nutch.crawl.AdaptiveFetchSchedule db.fetch.schedule.adaptive.inc_rate 0.4 db.fetch.schedule.adaptive.dec_rate 0.2

Nutch Crawling Snapshot Average Freshness achieved with Nutch Fetch Policy – 0.5

Data Integration and Calculations (Excel) Data Snippet after Integration Age and Freshness Calculations Per Site Average Age per Site = (Sum of Ages)/ (Number of Crawls) Average Freshness per Site = (Sum of Freshness) / (Number of Crawls) Per Category Average Age per Category = (Sum of Average Site Ages) / (Site Count) Average Freshness per Category = (Sum of Average Site Freshness) / (Site Count)

Standard Deviation Standard Deviation in Age for a Category (Days) = sqrt [ (sum of squares of age difference) / Site Count ] CategoryBing Std DevGoogle Std Dev Education4.0195015471.115546433 Movies8.3082079691.137753946 News2.303239711.335194148 Technology4.4826740181.08891161

Data Analysis Age Comparison between Google and Bing Conclusions : Google Database is much more up to date as compared to Bing Google crawls news sites more than once a day Google crawling cycle is mostly consistent across different categories Google average crawling cycle is 0.8 Days Bing average crawling cycle is 4.6 Days

Data Analysis Freshness Comparison between Google and Bing Conclusions : News sites change frequently and so even though the Age for News sites is low, cached page is usually not fresh Google Average Freshness is 0.65 Bing Average Freshness is 0.28

Data Analysis Comparison of Standard Deviation across Domains Conclusions : Google’s standard deviation is low which indicates category of a site is not a major factor while deciding frequency of crawl Same inference does not apply for Bing

Data Analysis Alexa Rank (x-axis) vs Google Cache Age (y-axis) Conclusion: Google - Sites with high traffic are crawled more frequently

Data Analysis Alexa Rank (x-axis) vs Bing Cache Age (y-axis) Conclusion : Bing crawling is uniform across sites with varying traffic volume

Data Analysis Date Modified vs Crawl Date Conclusion : Google Crawling seems to be more adaptive to original site changes while Bing crawling is uniform for sites with high ranking

Data Analysis Date Modified vs Crawl Date Conclusion : Google as well as Bing Crawling seems to be uniform for low ranking sites

Conclusions Google Freshness Policy Factors Identified Popularity/Traffic volume Category not considered Frequency of Change of a page affects Crawling cycle – Adaptive ! Bing Freshness Policy Factors Identified Site popularity is not considered Category is considered Frequency of Change of a page affects Crawling cycle – Adaptive !

Limitations and Future Work Limitations Conclusions are drawn on a limited random data sample because of Crawling restrictions on Google cached data Change in Bing cached links every time Bing’s cached repository is updated Larger time frame is required to identify crawling behavior of each search engine High Freshness was observed for Nutch as crawling interval was low Future Work Additional factors like number of incoming and outgoing links can be noted and its co-relation to crawling can be observed Factors like ranking, popularity, number of outgoing links can be incorporated in Nutch Adaptive Fetch Policy

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Similar presentations

Presentation on theme: "Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.

Similar presentations

Presentation on theme: "Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA."— Presentation transcript:

Similar presentations

About project

Feedback