Junghoo “John” Cho UCLA

Junghoo “John” Cho UCLA
CS246: Web Crawling Junghoo “John” Cho UCLA

What is a Crawler? initial urls init to visit urls get next url web
get page visited urls extract urls web pages

Challenges Q: The process seems straightforward. Anything difficult? Is it just a matter of implementation? What are the issues?

Crawling Issues Load at the site Load at the crawler Page selection
Crawler should be unobtrusive to visited sites Load at the crawler Download billions of Web pages in short time Page selection Many pages, limited resources Page refresh Refresh pages incrementally not in batch

Page Refresh How can we maintain “cached” pages “fresh”?
The technique can be useful for web search engines, data warehouse, etc. Refresh Source Copy

Other Caching Problems
Disk buffers Disk page, memory page buffer Memory hierarchy 1st level cache, 2nd level cache, … Is Web caching any different?

Main Difference Origination of changes Freshness requirement
Cache to source Source to cache Freshness requirement Perfect caching Stale caching Role of a cache Transient space: cache replacement policy Main data source for application Refresh delay

Main Difference Limited refresh resources Mainly pull model
Many independent sources Network bandwidth Computational resources … Mainly pull model

Ideas? Q: How can we maintain pages “fresh”? What ideas can we explore to “refresh” pages well? Idea 1: Some pages change often, some pages do not. News archive vs daily news article Q: Can we do things differently depending on how often they change? Idea 2: A set of pages change together Java manual pages Q: Can we do something when we notice that some pages change together? Q: How can we formalize these ideas as a computational problem?

Junghoo “John” Cho UCLA

Similar presentations

Presentation on theme: "Junghoo “John” Cho UCLA"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Junghoo “John” Cho UCLA

Similar presentations

Presentation on theme: "Junghoo “John” Cho UCLA"— Presentation transcript:

Similar presentations

About project

Feedback