Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Similar presentations


Presentation on theme: "6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao."— Presentation transcript:

1 6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao

2 6/16/2015 2 Introduction Classical IR: Indexing a collection of documents Answering queries by returning a ranked list of relevant document Problems for retrieve online document Ambiguity Context sensitivity Synonymy Polysemy Large amount of relevant Web pages

3 6/16/2015 3 Introduction Directory-based topic browsing: tree-like structure Most Maintained by human expert Advantages: exemplary, influential Disadvantages: slow, subjective and noisy

4 6/16/2015 4 Introduction Standard crawler and search engine 1997: cover 35-40% out of 340 million Web pages 1999: cover 18% out of 800 million Web pages Cannot be used for maintaining generic portals and automatic resource discovery

5 6/16/2015 5 Introduction Focused crawler: Can selectively seek out pages that are relevant to pre-defined set of topics Experts and researchers preferred Two modules: Classifier: analyzes the text in and links around a given web page and automatically assigns it to suitable directories in a web catalog Distiller: identifies the centrality of crawled pages to determine visit priorities

6 6/16/2015 6 Distillation techniques Google: Simulate a random wander on the Web Ranked by pre-computed popularity and visitation rate fast

7 6/16/2015 7 Distillation techniques HITS (Hyperlink Induced Topic Search): Depends on a search engine Combine two scores: Authorities: identify pages with useful information about a topic Hubs: identify pages that contain many links to pages with useful information on the topic Query dependent and slow May lead topic contamination or drift

8 6/16/2015 8 Distillation techniques ARC and CLEVER: ARC (Automatic Resource Complier): part of CLEVER Root set was expanded by 2 links instead of 1link ( Including all pages which are link-distance two or less from at least one page in the root set ) Assign weights to the hyperlinks: base on the match between the query and the text surrounding the hyperlink in the source document

9 6/16/2015 9 Distillation techniques Outlier filtering: Computes relevance weights for pages using Vector Space Model All pages whose weights are below a threshold are pruned Effectively prune away outlier nodes in the neighborhood, thus avoid contamination

10 6/16/2015 10 Topic distillation vs. Resource discovery Topic distillation: Depend on large, comprehensive Web crawls and indices (Post processing) Can be used to generate a Web taxonomy? Set a keyword query for each node in the taxonomy Run a distillation program Simple but have some problems

11 6/16/2015 11 Query: +"power suppl*" ßwitch* mode" smps - multiprocessor* üninterrupt* power suppl*" ups -parcel The Yahoo! node /Business&Economy /Companies /Electronics /PowerSuppliesYahoo! To match the directory based browsing quality of : Yahoo!: 7.03 terms and 4.34 operators Alta Vista: 2.35 terms and 0.41 operators Problems: Construction the query: involves trial, error and complicated thought Query: “North American telecommunication companies” Topic distillation vs. Resource discovery

12 6/16/2015 12 Topic distillation vs. Resource discovery Problems: Contamination stop-sites: not automatic terming weighting edge weighing: no precise algorithm to set the weight Topic distillation by itself is not enough for resource discovery

13 6/16/2015 13 Hypertext classification: learning from example Adding example pages and their distance-1 neighbors into the graph to be distilled will improve the result The contents of the given example and its neighbors provide a way to compute the decision boundary of classification NN, Bayesian and support vector classifiers

14 6/16/2015 14 Hypertext classification Link-based features: important Circular topic influence Topic of one page influences its text and its neighbor page’s topic Knowledge of the linked vicinity’s topic provides clues for the test document’s topic Bibliometric, more general than the simple linear endorsement model used in topic distillation

15 6/16/2015 15 Putting it together for resource discovery

16 6/16/2015 16 Conclusion Emphasized the importance of scalable automatic resource discovery Argued that common search engines are not adequate to achieve the resource discovery Introduced the recently invented focused crawling system

17 6/16/2015 17 Future Works How to derive the training examples automatically? How to personalize the outcome of focused crawler for users?


Download ppt "6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao."

Similar presentations


Ads by Google