The Structure of Broad Topics on the Web Soumen Chakrabarti, Mukul M. Joshi, etc Presentation by Na Dai
Introduction & Contribution Convergence of topic distribution on undirected random walks Degree distribution restricted to topics How topic-biased are breadth-first crawls? Representation of topics in Web directories Topic convergence on directed walks Link-based vs. content-based Web communities
Building Blocks Sampling Web pages –PageRank-based random walk Wander walk –The Bar-Yossef random walk Sampling walk Undirected graph Regular Taxonomy design & Document classification –271,954 topics, 6 levels, 1,697,266 sample URLs –Pruned: taxonomy 482 leaf nodes, 144,859 sample URLs –Classification: Rainbow naïve Bayes classifier
Convergence Sampling method –Sampling walk Topic distribution of a set –Soft counting Difference measure –L1 distance
The background distribution vs. breadth-first crawls
Faithful representation of topics in Web directory
Topic-specific degree distributions Power law distribution –Pr(i) = k*1/i x (x>1) Contribution to Class c –Soft-counting –Δd p c (d)
Topical locality and link-based prestige ranking Sampling method –Wander walk Class selection –Dmoz, well-populated Collect all the pages at distance i (i>0)
Topical locality and link-based prestige ranking
Relations between topics Topic citation matrix Contribution to topic citation matrix C –C C + p(u) T p(v) Implications and application –Improved hypertext classification –Enhanced focused crawling –Reorganizing topic directories
Concluding remarks Characterize some important notions of topical locality on the web Open problems –PageRank jump parameter –Topical stability of distillation algorithms –Better crawling algorithms
Q & A?