Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006

What is focused crawling Crawling vs. Focused crawling Seed Page Target page

Crawling methods Web search algorithm: – Breadth-first (using in standard crawling) – Best-first (using in focused crawling) – They are local-search strategies Web analysis algorithm – content-based web analysis page text, title, URL, page layout – link-based web analysis hard to analyze the page while the knowledge about the search graph is not yet known completely.

Related works Naïve Bayes Crawler: relevance score is the cosine similarity between page and topic IBM focused crawler introduce a distiller to find topic hubs. CORA crawler: assign Q-value according number of target pages in neighborhood Context focused crawler introduce a link hierarchy Automatic Publication Data Gatherer: classified the webpage without the page PaSE: locate publication using Search Engine

General framework repository Page fetch UnitURL filterURL extractor Frontier Classifier Feature extractor Highly depend on the seed pages Term Extraction module

Baseline system

Three stage of the crawling

Framework for upgraded system

TargetURLSearch Engine More Pages Term Extraction

Baseline systemUpgrade system Publication pages found45117 precision 3.21%8.36% recall 26.63%69.23% F1 0.0570.149

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Similar presentations

Presentation on theme: "Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Similar presentations

Presentation on theme: "Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006."— Presentation transcript:

Similar presentations

About project

Feedback