Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Similar presentations


Presentation on theme: "Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei."— Presentation transcript:

1 Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei

2 1. J.Qin et al. Building Domain Specific Web Collections for Scientific Digital Libraries: A Meta-Search Enhanced Focused Crawling Method 2. G.Pant et al. Panorama: Extending Digital Libraries with Topical Crawlers  (JCDL 2004)

3 Outline Problem Description Research Background Their Approaches  Designing Classifier  Enhancing Meta-Search  Identifying Communities in Collection Experiments Discussion

4 Problem Description Problem: Collect domain specific documents from the Web and manage the literature collection. Topical Crawlers (Focused Crawlers) are designed to collect domain specific docs How to bridge the gab of Web communities.

5 Digital Library vs. Search Engine Digital Library  Domain specific, serving for literature study  High Quality  Topical Crawler + Collection management  Knowledge discovery Search Engine  General, serving for web search  High Quantity  General Crawler + Online retrieval  Indexing, retrieving performance, etc.

6 Research Background Domain Definition VSM, Naïve Bayesian, SVM, etc. Expand training set, Get starting url BFS, Best first search, Tree pruning, Multiple starting urls, Tunneling TF-IDF, K-Mean, etc Page Rank, HITS, etc.

7 Why General Crawler with Breadth first search doesn’t work?

8 Web Communities

9 Design Classifier (Pant et al.) Motivation: define the domain and distinguish relevant & non-relevant documents Approach:  Query Google with title & reference to construct positive/negative example set (training set)  Use Vector Space Model to represent documents, use TF-IDF as term weights  Use Naïve Bayesian Classifier to estimate Pr(c + |q), which is used for ranking

10 Design Classifier (cont.) TF-IDF weighting:

11 Enhancing Meta-Search (Qin et al.) Motivation: Solve the limitation of Local Search algorithm in Crawling, bridge distributed web communities Approach:  Manually provide domain specific queries  Query Meta-search Engine to get multiple starting urls.

12 Identifying Communities in Collection (Pant et al.) Motivation: analyze the latent structures in collection, summarize and represent potential communities Approach:  Use k-mean for content clustering  Use HITS for structural clustering  Label clusters by TF-IDF filtering

13 Experiments (Qin et al.) Experiments Design  Compare with Google and a Domain Specific SE. 996028 pages, 1/3 from meta-search method. pre@20.  Compare meta-search enhanced crawling with traditional one, by means of precision. 997632 pages from baseline method. pre@10  Experts define queries and judge results.

14 Experiments (Qin et al.) cont. Experiment Result 1:  Their approach:  General SE:  Domain Specific:  Meta-search enhanced method better than general search engine and traditional domain specific search engine. Experiment Result 2  Expert ranking results in range 1-4  Meta-search Enhanced: 2.77  Baseline: 2.51  In Top 100 results from Meta-search Enhanced collection: from meta-search: 3.22, rest: 2.61

15 Experiments (Pant et al.) Experiments Design:  Test Bed: from CiteSeer. 94 papers as initial documents. Use one (expanded by querying Google) for building positive example set, 93 for building negative example set. Compare with a BFS crawler. Harvest rate:

16 Experiments (Pant et al.) cont. Experiment results: InterWeave: A middleware system for distributed shared states

17 Conclusions System overview for building literature collection by topical web crawlers Classifier enhanced Best first search performs better than Breadth first search. Meta-search enhanced topical crawler performs better than topical crawlers without meta-search. A clustering based method to represent latent community structures in collection

18 Discussion Contribution of these two papers  Qin et’ al: enhance meta-search to get multiple starting urls  Pant et’ al: clarify and implement a sound system structure. Post a way to discover latent communities in collection Constraints  no significant theoretical contribution  experiments not convincing

19 Thanks!


Download ppt "Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei."

Similar presentations


Ads by Google