Download presentation
Presentation is loading. Please wait.
Published byJob Charles Modified over 9 years ago
1
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006
2
What is focused crawling Crawling vs. Focused crawling Seed Page Target page
3
Crawling methods Web search algorithm: – Breadth-first (using in standard crawling) – Best-first (using in focused crawling) – They are local-search strategies Web analysis algorithm – content-based web analysis page text, title, URL, page layout – link-based web analysis hard to analyze the page while the knowledge about the search graph is not yet known completely.
4
Related works Naïve Bayes Crawler: relevance score is the cosine similarity between page and topic IBM focused crawler introduce a distiller to find topic hubs. CORA crawler: assign Q-value according number of target pages in neighborhood Context focused crawler introduce a link hierarchy Automatic Publication Data Gatherer: classified the webpage without the page PaSE: locate publication using Search Engine
5
General framework repository Page fetch UnitURL filterURL extractor Frontier Classifier Feature extractor Highly depend on the seed pages Term Extraction module
6
Baseline system
7
Three stage of the crawling
8
Framework for upgraded system
9
TargetURLSearch Engine More Pages Term Extraction
11
Baseline systemUpgrade system Publication pages found45117 precision 3.21%8.36% recall 26.63%69.23% F1 0.0570.149
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.