Presentation is loading. Please wait.

Presentation is loading. Please wait.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

Similar presentations


Presentation on theme: "Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine."— Presentation transcript:

1 Publication Spider Wang Xuan 07/14/2006

2 What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine

3 What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine

4 What is focused crawling Crawling vs. Focused crawling

5 Crawling methods Web search algorithm: – Breadth-first (using in standard crawling) – Best-first (using in focused crawling) – They are local-search strategies Web analysis algorithm – content-based web analysis page text, title, URI, page layout – link-based web analysis hard to analyze the page while the knowledge about the search graph is not yet known completely.

6 Focused Crawling Learning phase Crawling phase

7 Related work - naïve Bayes Crawler one of the simplest focused crawler text extracted is represented as a vector of words weighted by the words frequency relevance score is the cosine similarity between page p and the query q representing the topic Only focus on target pages, assign low priority to source link.

8 Related work – Context focused Crawler Representation of the context - in which the target pages are found, by a graph. page in layer (i) has a direct link to some page in layer (i-1) layer (0) contains the target page N classifier, one for each layer

9 What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine

10 Related Work - PaSE (Page Search Engine) Given citation information, find the online PDF document – the top 10 links return from google --> right page that is likely contain online PDF Breadth-first, Depth-first, Radom – right web page --> identify the right (citation, PDF) pair. using (title, PDF) pick the PDF link with shortest distance to the citation block

11 General framework for spider repository Page fetch UnitURL filterURL extractor Frontier Classifier Feature extractor Target Repository googleapi Highly depend on the seed pages keywordsTarget pages

12 Target Repository Search Engine More Pages PublicationEntryKeyWord

13

14 Future work Scale up the evaluation Improve the performance of spider


Download ppt "Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine."

Similar presentations


Ads by Google