1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta

2 Data extraction Website- specific wrappers Webpages from a site Structured DB (product_name, price, rating)

3 Data Extraction Wrapper 1 Wrapper 2 Building wrappers [Muslea+/98, Crescenzi+/01, Cohen+/02, Hogue+/05, Irmak+/06] Cluster pages from the website based on similarity of DOM structure Pick a few example pages per cluster Manually annotate the DOM nodes which contain the data Automatic wrapper induction using these annotations

4 Data Extraction Clustering affects quality  Too few clusters: Heterogeneity of clusters Imperfect wrappers, or even inability to build wrappers  Too many clusters: Significant editorial effort required to build wrappers We want to automatically get a good clustering, for any website

5 Main Idea “Useful” info on a page Wrappers extract it Users search for it html h1 b search +click search terms match page content DOM paths repeatedly referenced by search terms are “key” paths “html h1” and “html h1 b” are key paths

6 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

7 Mapping pages to clusters Pages in a cluster should have similar tree structure  and hence, similar paths  Represent a page by a shingle of its paths [Buttler/04] Using key paths:  Shingle preferentially picks key paths in the page  Requires a global ranking of key paths

8 Mapping pages to clusters One cluster per shingle All pages in a cluster share the same k “key” paths

9 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

10 Identify key paths For every (query, webpage) pair  match query terms to text of a DOM path  yields precision and recall for every path Need to aggregate over all queries and webpages  Expected precision and recall of a path High if path appears on many queried pages, and has high precision/recall in most of them html h1 b title price

11 Identify key paths How can we combine expected precision and recall into one ranking of key paths?  F-measure, but Precision typically more important than recall Precision and recall may be in completely different scales This scaling factor varies among websites

12 Identify key paths How can we combine expected precision and recall into one ranking of key paths?  Borda method [Borda/1781] Create two rankings of paths, one by precision and one by recall Combine rankings into one ranking, using relative importance of precision to recall Immune to varying scales of precision/recall values among websites

13 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths, but  Key paths can be dependent  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths

14 Handling dependent paths Consider the following two paths:  html body div div table tr td h1 span (“product name”)  html body div div table tr td h1  If one is a key path, probably the other is too Shingle can get “swamped”  Shingle of a page becomes: (product_name, product_name_parent, product_name_ancestor)  instead of: (product_name, buy_button, rating)

15 Handling dependent paths Several sources of dependence  Multiple paths may have similar content “product name” header and its parent product name mentioned in a header and in the text  Multiple paths may always co-occur “product name” header and “price”

16 Handling dependent paths Identify key independent paths  Build a graph of dependencies between paths  Pick an independent set of paths i.e., a set of paths where no one is connected to another  Computation is weighted strongly towards the top- ranked paths Under our weighting scheme, greedily picking an independent set is optimal

17 Main Idea Clustering using key paths  Pre-processing step (for each site) Given a large sample of pages and search logs Identify key paths  Run-time (for that website) Given a new webpage Find which key paths exist on the page Map page to cluster using its key paths  Several other optimizations (in paper)

18 Experiments 10 major websites  Sampled ~20,000 pages each  Built ground truth Ran an existing clustering algorithm Manually checked results  Homogeneous clusters: merge when necessary  Heterogeneous clusters: change parameters, repeat  Small sample of search logs ~5K unique queries per site Far fewer than the number of pages per site

19 Experiments Compared to clustering using well-known tree-similarity metrics  Path Shingles: Shingle of DOM paths without using key paths [Buttler/04]  pq-Grams: Shingle of sub-trees of DOM tree [Augsten+/05]  m/k Path Shingles: Like path shingles, except only m out of k shingle elements need to match

20 Experiments Compared clustering using Adjusted RAND index  higher is better, 1.0 is perfect Our algorithm [Buttler/04][Augsten+/05] Search logs give significant lift, with very low variance

21 Experiments Comparison against paths actually used by manually-designed wrappers Precision of IndepPaths Key Paths correspond to paths used in wrappers

22 Experiments Examples of top-ranked paths

23 Conclusions Clusters affect both  wrapper quality, and  degree of editorial effort We use search logs to automatically find good clusters Current efforts:  Combining search features with content features to pick key paths

24 Mapping pages to clusters Given an ranked list of key paths Given a shingle-size k For any page P  Find KP = all key paths in P  If |KP| < k Shingle = KP plus randomly chosen paths from page  else Shingle = top-ranked k paths in KP

25 Experiments

26 Experiments

27 Experiments Compared clustering using Adjusted RAND index  higher is better, 1.0 is perfect Our algorithm Shingle of 8 paths; only 6 need to match Shingles w/o key paths [Buttler/04][Augsten+/05] Shingles of DOM subtrees Search logs give significant lift, with very low variance

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Similar presentations

Presentation on theme: "1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Similar presentations

Presentation on theme: "1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta."— Presentation transcript:

Similar presentations

About project

Feedback