Download presentation
Presentation is loading. Please wait.
1
1 Discussion Class 6 Crawling the Web
2
2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.
3
3 Question 1: Background (a)When was this paper written, by whom, and why? (b)What, if anything, has changed since this paper was written? (c)How has Yahoo changed?
4
4 Question 2: Search engine architecture
5
5 Question 2: Search Engine Architecture What is the function of the following? (a)Crawl control (b)Indexer module (c)Structure index (d)Ranking module (e)Page repository
6
6 Question 3: What pages should the crawler download? (a)What is the problem? Why do crawlers not download every page? (b)What can a crawler know about a page without downloading it? (c)The paper describes several importance measures: interest- driven, popularity-driven, location-driven. How do they apply? (d)How do these importance measures interact with the ordering metrics?
7
7 Question 4: How should the crawler refresh pages? (a)What is the problem? (b)The paper discusses a "freshness" metric. What is this? Do you consider it a good metric?
8
8 Question 5: How should the load on the visited Web sites be minimized? (a)Why is this a problem? (b)What can a crawler do to minimize the problem? (c)What can a web site do to minimize the problem?
9
9 Question 6: How should the crawling process be parallelized? (a)Why should the crawling process be parallelized? (b)What are the principal options?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.