Junghoo “John” Cho UCLA CS246: Web Crawling Junghoo “John” Cho UCLA
What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages
Challenges Q: The process seems straightforward. Anything difficult? Is it just a matter of implementation? What are the issues?
Crawling Issues Load at the site Load at the crawler Page selection Crawler should be unobtrusive to visited sites Load at the crawler Download billions of Web pages in short time Page selection Many pages, limited resources Page refresh Refresh pages incrementally not in batch
Page Refresh How can we maintain “cached” pages “fresh”? The technique can be useful for web search engines, data warehouse, etc. Refresh Source Copy
Other Caching Problems Disk buffers Disk page, memory page buffer Memory hierarchy 1st level cache, 2nd level cache, … Is Web caching any different?
Main Difference Origination of changes Freshness requirement Cache to source Source to cache Freshness requirement Perfect caching Stale caching Role of a cache Transient space: cache replacement policy Main data source for application Refresh delay
Main Difference Limited refresh resources Mainly pull model Many independent sources Network bandwidth Computational resources … Mainly pull model
Ideas? Q: How can we maintain pages “fresh”? What ideas can we explore to “refresh” pages well? Idea 1: Some pages change often, some pages do not. News archive vs daily news article Q: Can we do things differently depending on how often they change? Idea 2: A set of pages change together Java manual pages Q: Can we do something when we notice that some pages change together? Q: How can we formalize these ideas as a computational problem?