Download presentation
Presentation is loading. Please wait.
Published byAndra Smith Modified over 8 years ago
2
Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering Indian Institute of Technology Bombay www.cse.iitb.ac.in/~soumen
3
WWW2001 Topic distillation Crisis of abundance Short queries, many relevant pages Popularity must complement relevance Link-based quality ranking PageRank/Google HITS, Clever, topic distillation Limitation Page = node, link = edge, node has scores Classical notion of document boundary too simple to capture complex Web idioms
4
WWW2001 Hyperlink Induced Topic Search Expanded graph Root-set Keyword Search engine Query a = E T h h = Ea ‘Hubs’ and ‘authorities’ h a h h h a a a
5
WWW2001 Tightly-knit pseudo-communities Links relevant to “amusement parks” on www.411fun.com Multi-site nepotistic links generated from template form pseudo-community Low- popularity content on “affirmative action”
6
WWW2001 Mixed hubs with ‘hot’ and ‘cold’ links This section specializes on ‘Shakespeare’ Remaining sections generalize and/or drift
7
WWW2001 Challenge Evolution of Web content since 1996 Large complex dynamic pages generated as views from databases Rampant multi-host ‘nepotism’: web-rings and banner exchanges Linear model for conferral of authority less valid File or page boundary less meaningful Deteriorating results of topic distillation
8
WWW2001 Document object model (DOM) Hierarchical graph model for semi- structured data Tag-tree for HTML Can extract reasonable DOM from HTML Fine-grained view of the Web: trees with ‘macro’-links from leaves to roots Portals Yahoo Lycos html headbody titleul li aa
9
WWW2001 Why not run HITS on DOM graph? Bipartite cores central to the success of HITS Authority diffusion easily blocked by editorial idiosyncrasy Need a better model 1 23 DOM Tree 3 2 4 1 5
10
WWW2001 Other unsuccessful ideas Resistive networks Flow-based formulations 123 9 6 3
11
WWW2001 A model for hub (score) generation Global hub score distribution 0 w.r.t. given query Authors use DOM nodes to specialize 0 into local I At a certain frontier in the DOM tree, local distribution directly generates hub scores in ‘hot’ and ‘cold’ subtrees Global distribution Progressive ‘distortion’ Model frontier Other pages
12
WWW2001 A balanced cost measure HvHv v u Reference distribution 0 Cumulative distortion cost = KL( 0 ; u ) + … + KL( u ; v ) Data encoding cost is roughly (for exponential distribution) Goal: Find minimum cost frontier
13
WWW2001 Optimizing the cost measure Hard to solve exactly (knapsack) (1+ ) dynamic programming solution Too slow for 10 million DOM nodes Greedy expansion approach: at each node v, compare the cost of Directly encoding H v w.r.t. model at v First distorting v to w for each child w of v, then encoding all H w w.r.t. respective w If latter is smaller expand v, else prune Aggregate hub scores at frontier nodes
14
WWW2001 Modified topic distillation algorithm Will this (non-linear) system converge? Will segmentation help in reducing drift? Can the system extract relevant micro-hubs? Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub segmentation Hub score aggregation at frontier nodes Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results
15
WWW2001 Convergence 28 queries used in Clever and by B&H 366k macro-pages, 10M micro-links Ranks converge within 15 iterations
16
WWW2001 Hub segmentation dynamics ‘Pruned’ = whole hub relevant, ‘expanded’ = subtree preferred In successive iterations #pruned increases #expanded decreases Residual expansions reduce authority leaks via mixed hubs X-axis: #iterations Y-axis: #pruned, #expanded
17
WWW2001 Avoiding topic drift Easy query: cycling Drift-prone query: affirmative action
18
WWW2001 Rank correlation with B&H Positively correlated Some negative deviations Pseudo- authorities downgraded by our algorithm These were earlier favored by mixed hubs (Axes not to same scale)
19
WWW2001 “Recall-precision” measurements From rootset docs get term vector From k top auths get term vector Find cosine with ‘TFIDF’ weights Plot cosine vs. k DOM-HITS shows higher average similarity (less topic drift)
20
WWW2001 Anecdotes “amusement parks”: http://www.411fun.com/THEMEPARKS leaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. http://www.411fun.com/THEMEPARKS leaks authority www.411florists.com www.411fashion.com www.411eshopping.com New algorithm reduces driftreduces drift Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi amusement parksclassical guitar Shakespearesushi Mixed hubs in top 50 for 13/28 queries
21
WWW2001 Conclusion Hypertext show complex idioms, missed by coarse-grained graph model Enhanced fine-grained distillation Identifies content-bearing ‘hot’ micro-hubs Disaggregates hub scores Reduces topic drift via mixed hubs and pseudo-communities Ongoing and future work Model text, tag-tree and links (SIGIR 2001) Scale up micro-level distillation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.