1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski.

Evaluating Search Engine

Optimal Schemes for Robust Web Extraction Aditya Parameswaran Stanford University (Joint work with: Nilesh Dalvi, Hector Garcia-Molina, Rajeev Rastogi)

WISE: Large Scale Content-Based Web Image Search Michael Isard Joint with: Qifa Ke, Jian Sun, Zhong Wu Microsoft Research Silicon Valley 1.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Xyleme A Dynamic Warehouse for XML Data of the Web.

Aki Hecht Seminar in Databases (236826) January 2009

Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.

A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.

Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,

Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.

Presented by Arun Qamra

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:

Chapter 5: Information Retrieval and Web Search

Improving Software Package Search Quality Dan Fingal and Jamie Nicolson.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.

Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.

Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Chapter 6: Information Retrieval and Web Search

Presenter: Shanshan Lu 03/04/2010

Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.

Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.

Algorithmic Detection of Semantic Similarity WWW 2005.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Post-Ranking query suggestion by diversifying search Chao Wang.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,

Julián ALARTE DAVID INSA JOSEP SILVA

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Based on Menu Information

Web Data Extraction Based on Partial Tree Alignment

Panagiotis G. Ipeirotis Luis Gravano

Compact routing schemes with improved stretch

Presentation transcript:

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta

2 Data extraction Website- specific wrappers Webpages from a site Structured DB (product_name, price, rating)

3 Data Extraction Wrapper 1 Wrapper 2 Building wrappers [Muslea+/98, Crescenzi+/01, Cohen+/02, Hogue+/05, Irmak+/06] Cluster pages from the website based on similarity of DOM structure Pick a few example pages per cluster Manually annotate the DOM nodes which contain the data Automatic wrapper induction using these annotations

4 Data Extraction Clustering affects quality  Too few clusters: Heterogeneity of clusters Imperfect wrappers, or even inability to build wrappers  Too many clusters: Significant editorial effort required to build wrappers We want to automatically get a good clustering, for any website

5 Main Idea “Useful” info on a page Wrappers extract it Users search for it html h1 b search +click search terms match page content DOM paths repeatedly referenced by search terms are “key” paths “html h1” and “html h1 b” are key paths

6 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

7 Mapping pages to clusters Pages in a cluster should have similar tree structure  and hence, similar paths  Represent a page by a shingle of its paths [Buttler/04] Using key paths:  Shingle preferentially picks key paths in the page  Requires a global ranking of key paths

8 Mapping pages to clusters One cluster per shingle All pages in a cluster share the same k “key” paths

9 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

10 Identify key paths For every (query, webpage) pair  match query terms to text of a DOM path  yields precision and recall for every path Need to aggregate over all queries and webpages  Expected precision and recall of a path High if path appears on many queried pages, and has high precision/recall in most of them html h1 b title price

11 Identify key paths How can we combine expected precision and recall into one ranking of key paths?  F-measure, but Precision typically more important than recall Precision and recall may be in completely different scales This scaling factor varies among websites

12 Identify key paths How can we combine expected precision and recall into one ranking of key paths?  Borda method [Borda/1781] Create two rankings of paths, one by precision and one by recall Combine rankings into one ranking, using relative importance of precision to recall Immune to varying scales of precision/recall values among websites

13 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths, but  Key paths can be dependent  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

14 Handling dependent paths Consider the following two paths:  html body div div table tr td h1 span (“product name”)  html body div div table tr td h1  If one is a key path, probably the other is too Shingle can get “swamped”  Shingle of a page becomes: (product_name, product_name_parent, product_name_ancestor)  instead of: (product_name, buy_button, rating)

15 Handling dependent paths Several sources of dependence  Multiple paths may have similar content “product name” header and its parent product name mentioned in a header and in the text  Multiple paths may always co-occur “product name” header and “price”

16 Handling dependent paths Identify key independent paths  Build a graph of dependencies between paths  Pick an independent set of paths i.e., a set of paths where no one is connected to another  Computation is weighted strongly towards the top- ranked paths Under our weighting scheme, greedily picking an independent set is optimal

17 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths  Several other optimizations (in paper)

18 Experiments 10 major websites  Sampled ~20,000 pages each  Built ground truth Ran an existing clustering algorithm Manually checked results  Homogeneous clusters: merge when necessary  Heterogeneous clusters: change parameters, repeat  Small sample of search logs ~5K unique queries per site Far fewer than the number of pages per site

19 Experiments Compared to clustering using well-known tree-similarity metrics  Path Shingles: Shingle of DOM paths without using key paths [Buttler/04]  pq-Grams: Shingle of sub-trees of DOM tree [Augsten+/05]  m/k Path Shingles: Like path shingles, except only m out of k shingle elements need to match

20 Experiments Compared clustering using Adjusted RAND index  higher is better, 1.0 is perfect Our algorithm [Buttler/04][Augsten+/05] Search logs give significant lift, with very low variance

21 Experiments Comparison against paths actually used by manually-designed wrappers Precision of IndepPaths Key Paths correspond to paths used in wrappers

22 Experiments Examples of top-ranked paths

23 Conclusions Clusters affect both  wrapper quality, and  degree of editorial effort We use search logs to automatically find good clusters Current efforts:  Combining search features with content features to pick key paths

24 Mapping pages to clusters Given an ranked list of key paths Given a shingle-size k For any page P  Find KP = all key paths in P  If |KP| < k Shingle = KP plus randomly chosen paths from page  else Shingle = top-ranked k paths in KP

25 Experiments

26 Experiments

27 Experiments Compared clustering using Adjusted RAND index  higher is better, 1.0 is perfect Our algorithm Shingle of 8 paths; only 6 need to match Shingles w/o key paths [Buttler/04][Augsten+/05] Shingles of DOM subtrees Search logs give significant lift, with very low variance