Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington

Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Mining the WebChakrabarti and Ramakrishnan1 Shortcomings of the coarse- grained graph model  No notice of The text on each page The markup structure on.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.

Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Web IR.

Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June

Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006

Computing Trust in Social Networks

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.

Link Structure and Web Mining Shuying Wang

(hyperlink-induced topic search)

Topic Distillation and Web Page Categorization Prasanna K. Desikan (05/29/2002)

Overview of Web Data Mining and Applications Part I

Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Tag-based Social Interest Discovery

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.

Using Hyperlink structure information for web search.

Data Structures & Algorithms and The Internet: A different way of thinking.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.

Querying Structured Text in an XML Database By Xuemei Luo.

Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.

Chapter 6: Information Retrieval and Web Search

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Overview of Web Ranking Algorithms: HITS and PageRank

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Algorithmic Detection of Semantic Similarity WWW 2005.

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Ranking Link-based Ranking (2° generation) Reading 21.

Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.

Machine learning for the Web: Applications and challenges Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)

CS155b: E-Commerce Lecture 16: April 10, 2001 WWW Searching and Google.

Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.

Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

HITS Hypertext-Induced Topic Selection

7CCSMWAL Algorithmic Issues in the WWW

Text & Web Mining 9/22/2018.

CSE 454 Advanced Internet Systems University of Washington

A Comparative Study of Link Analysis Algorithms

Lecture 22 SVD, Eigenvector, and Web Search

HITS Hypertext Induced Topic Selection

CS 440 Database Management Systems

Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.

HITS Hypertext Induced Topic Selection

Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg

Junghoo “John” Cho UCLA

Alan Kuhnle*, Victoria G. Crawford, and My T. Thai

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Presentation transcript:

Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Soumen Chakrabarti Center for Intelligent Internet Research Computer Science and Engineering Indian Institute of Technology Bombay

WWW2001 Topic distillation  Crisis of abundance Short queries, many relevant pages Popularity must complement relevance  Link-based quality ranking PageRank/Google HITS, Clever, topic distillation  Limitation Page = node, link = edge, node has scores Classical notion of document boundary too simple to capture complex Web idioms

WWW2001 Hyperlink Induced Topic Search Expanded graph Root-set Keyword Search engine Query a = E T h h = Ea ‘Hubs’ and ‘authorities’ h a h h h a a a

WWW2001 Tightly-knit pseudo-communities Links relevant to “amusement parks” on Multi-site nepotistic links generated from template form pseudo-community Low- popularity content on “affirmative action”

WWW2001 Mixed hubs with ‘hot’ and ‘cold’ links This section specializes on ‘Shakespeare’ Remaining sections generalize and/or drift

WWW2001 Challenge  Evolution of Web content since 1996 Large complex dynamic pages generated as views from databases Rampant multi-host ‘nepotism’: web-rings and banner exchanges  Linear model for conferral of authority less valid  File or page boundary less meaningful  Deteriorating results of topic distillation

WWW2001 Document object model (DOM)  Hierarchical graph model for semi- structured data  Tag-tree for HTML  Can extract reasonable DOM from HTML  Fine-grained view of the Web: trees with ‘macro’-links from leaves to roots Portals Yahoo Lycos html headbody titleul li aa

WWW2001 Why not run HITS on DOM graph?  Bipartite cores central to the success of HITS  Authority diffusion easily blocked by editorial idiosyncrasy  Need a better model 1 23 DOM Tree

WWW2001 Other unsuccessful ideas Resistive networks Flow-based formulations

WWW2001 A model for hub (score) generation  Global hub score distribution  0 w.r.t. given query  Authors use DOM nodes to specialize  0 into local  I  At a certain frontier in the DOM tree, local distribution directly generates hub scores in ‘hot’ and ‘cold’ subtrees Global distribution Progressive ‘distortion’ Model frontier Other pages

WWW2001 A balanced cost measure HvHv v u Reference distribution  0 Cumulative distortion cost = KL(  0 ;  u ) + … + KL(  u ;  v ) Data encoding cost is roughly (for exponential distribution) Goal: Find minimum cost frontier

WWW2001 Optimizing the cost measure  Hard to solve exactly (knapsack)  (1+  ) dynamic programming solution  Too slow for 10 million DOM nodes  Greedy expansion approach: at each node v, compare the cost of Directly encoding H v w.r.t. model at v First distorting  v to  w for each child w of v, then encoding all H w w.r.t. respective w  If latter is smaller expand v, else prune  Aggregate hub scores at frontier nodes

WWW2001 Modified topic distillation algorithm  Will this (non-linear) system converge?  Will segmentation help in reducing drift?  Can the system extract relevant micro-hubs? Initialize DOM graph Let only root set authority scores be 1 Repeat until reasonable convergence: Authority-to-hub score propagation MDL-based hub segmentation Hub score aggregation at frontier nodes Hub-to-authority score propagation Normalization of authority scores Segment and rank micro-hubs Present annotated results

WWW2001 Convergence  28 queries used in Clever and by B&H  366k macro-pages, 10M micro-links  Ranks converge within 15 iterations

WWW2001 Hub segmentation dynamics  ‘Pruned’ = whole hub relevant, ‘expanded’ = subtree preferred  In successive iterations #pruned increases #expanded decreases  Residual expansions reduce authority leaks via mixed hubs X-axis: #iterations Y-axis: #pruned, #expanded

WWW2001 Avoiding topic drift Easy query: cycling Drift-prone query: affirmative action

WWW2001 Rank correlation with B&H  Positively correlated  Some negative deviations  Pseudo- authorities downgraded by our algorithm  These were earlier favored by mixed hubs (Axes not to same scale)

WWW2001 “Recall-precision” measurements  From rootset docs get term vector  From k top auths get term vector  Find cosine with ‘TFIDF’ weights  Plot cosine vs. k  DOM-HITS shows higher average similarity (less topic drift)

WWW2001 Anecdotes  “amusement parks”: leaks authority via nepotistic links to etc. leaks authority  New algorithm reduces driftreduces drift  Mixed hubs accurately segmented, e.g. amusement parks, classical guitar, Shakespeare and sushi amusement parksclassical guitar Shakespearesushi  Mixed hubs in top 50 for 13/28 queries

WWW2001 Conclusion  Hypertext show complex idioms, missed by coarse-grained graph model  Enhanced fine-grained distillation Identifies content-bearing ‘hot’ micro-hubs Disaggregates hub scores Reduces topic drift via mixed hubs and pseudo-communities  Ongoing and future work Model text, tag-tree and links (SIGIR 2001) Scale up micro-level distillation