Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent.

Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering Indian Institute of Technology Bombay www.cse.iitb.ac.in/~soumen

WWW2001 Topic distillation  Crisis of abundance Short queries, many relevant pages Popularity must complement relevance  Link-based quality ranking PageRank/Google HITS, Clever, topic distillation  Limitation Page = node, link = edge, node has scores Classical notion of document boundary too simple to capture complex Web idioms

WWW2001 Hyperlink Induced Topic Search Expanded graph Root-set Keyword Search engine Query a = E T h h = Ea ‘Hubs’ and ‘authorities’ h a h h h a a a

WWW2001 Tightly-knit pseudo-communities Links relevant to “amusement parks” on www.411fun.com Multi-site nepotistic links generated from template form pseudo-community Low- popularity content on “affirmative action”

WWW2001 Mixed hubs with ‘hot’ and ‘cold’ links This section specializes on ‘Shakespeare’ Remaining sections generalize and/or drift

WWW2001 Challenge  Evolution of Web content since 1996 Large complex dynamic pages generated as views from databases Rampant multi-host ‘nepotism’: web-rings and banner exchanges  Linear model for conferral of authority less valid  File or page boundary less meaningful  Deteriorating results of topic distillation

WWW2001 Document object model (DOM)  Hierarchical graph model for semi- structured data  Tag-tree for HTML  Can extract reasonable DOM from HTML  Fine-grained view of the Web: trees with ‘macro’-links from leaves to roots Portals Yahoo Lycos html headbody titleul li aa

WWW2001 Why not run HITS on DOM graph?  Bipartite cores central to the success of HITS  Authority diffusion easily blocked by editorial idiosyncrasy  Need a better model 1 23 DOM Tree 3 2 4 1 5

WWW2001 Other unsuccessful ideas Resistive networks Flow-based formulations 123 9 6 3

WWW2001 A model for hub (score) generation  Global hub score distribution  0 w.r.t. given query  Authors use DOM nodes to specialize  0 into local  I  At a certain frontier in the DOM tree, local distribution directly generates hub scores in ‘hot’ and ‘cold’ subtrees Global distribution Progressive ‘distortion’ Model frontier Other pages

WWW2001 A balanced cost measure HvHv v u Reference distribution  0 Cumulative distortion cost = KL(  0 ;  u ) + … + KL(  u ;  v ) Data encoding cost is roughly (for exponential distribution) Goal: Find minimum cost frontier

WWW2001 Optimizing the cost measure  Hard to solve exactly (knapsack)  (1+  ) dynamic programming solution  Too slow for 10 million DOM nodes  Greedy expansion approach: at each node v, compare the cost of Directly encoding H v w.r.t. model at v First distorting  v to  w for each child w of v, then encoding all H w w.r.t. respective w  If latter is smaller expand v, else prune  Aggregate hub scores at frontier nodes

WWW2001 Modified topic distillation algorithm  Will this (non-linear) system converge?  Will segmentation help in reducing drift?  Can the system extract relevant micro-hubs? Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub segmentation Hub score aggregation at frontier nodes Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results

WWW2001 Convergence  28 queries used in Clever and by B&H  366k macro-pages, 10M micro-links  Ranks converge within 15 iterations

WWW2001 Hub segmentation dynamics  ‘Pruned’ = whole hub relevant, ‘expanded’ = subtree preferred  In successive iterations #pruned increases #expanded decreases  Residual expansions reduce authority leaks via mixed hubs X-axis: #iterations Y-axis: #pruned, #expanded

WWW2001 Avoiding topic drift Easy query: cycling Drift-prone query: affirmative action

WWW2001 Rank correlation with B&H  Positively correlated  Some negative deviations  Pseudo- authorities downgraded by our algorithm  These were earlier favored by mixed hubs (Axes not to same scale)

WWW2001 “Recall-precision” measurements  From rootset docs get term vector  From k top auths get term vector  Find cosine with ‘TFIDF’ weights  Plot cosine vs. k  DOM-HITS shows higher average similarity (less topic drift)

WWW2001 Anecdotes  “amusement parks”: http://www.411fun.com/THEMEPARKS leaks authority via nepotistic links to www.411florists.com, www.411fashion.com, www.411eshopping.com, etc. http://www.411fun.com/THEMEPARKS leaks authority www.411florists.com www.411fashion.com www.411eshopping.com  New algorithm reduces driftreduces drift  Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi amusement parksclassical guitar Shakespearesushi  Mixed hubs in top 50 for 13/28 queries

WWW2001 Conclusion  Hypertext show complex idioms, missed by coarse-grained graph model  Enhanced fine-grained distillation Identifies content-bearing ‘hot’ micro-hubs Disaggregates hub scores Reduces topic drift via mixed hubs and pseudo-communities  Ongoing and future work Model text, tag-tree and links (SIGIR 2001) Scale up micro-level distillation

Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent.

Similar presentations

Presentation on theme: "Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent.

Similar presentations

Presentation on theme: "Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent."— Presentation transcript:

Similar presentations

About project

Feedback