Download presentation
Presentation is loading. Please wait.
1
Junghoo “John” Cho UCLA
CS246: Web Crawling Junghoo “John” Cho UCLA
2
What is a Crawler? initial urls init to visit urls get next url web
get page visited urls extract urls web pages
3
Challenges Q: The process seems straightforward. Anything difficult? Is it just a matter of implementation? What are the issues?
4
Crawling Issues Load at the site Load at the crawler Page selection
Crawler should be unobtrusive to visited sites Load at the crawler Download billions of Web pages in short time Page selection Many pages, limited resources Page refresh Refresh pages incrementally not in batch
5
Page Refresh How can we maintain “cached” pages “fresh”?
The technique can be useful for web search engines, data warehouse, etc. Refresh Source Copy
6
Other Caching Problems
Disk buffers Disk page, memory page buffer Memory hierarchy 1st level cache, 2nd level cache, … Is Web caching any different?
7
Main Difference Origination of changes Freshness requirement
Cache to source Source to cache Freshness requirement Perfect caching Stale caching Role of a cache Transient space: cache replacement policy Main data source for application Refresh delay
8
Main Difference Limited refresh resources Mainly pull model
Many independent sources Network bandwidth Computational resources … Mainly pull model
9
Ideas? Q: How can we maintain pages “fresh”? What ideas can we explore to “refresh” pages well? Idea 1: Some pages change often, some pages do not. News archive vs daily news article Q: Can we do things differently depending on how often they change? Idea 2: A set of pages change together Java manual pages Q: Can we do something when we notice that some pages change together? Q: How can we formalize these ideas as a computational problem?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.