1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Conceptual Clustering
Random Forest Predrag Radenković 3237/10
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Graph-Based Image Segmentation
ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.
Learning using Graph Mincuts Shuchi Chawla Carnegie Mellon University 1/11/2003.
Lecture 6 Image Segmentation
1 Block-based Web Search Deng Cai *1, Shipeng Yu *2, Ji-Rong Wen * and Wei-Ying Ma * * Microsoft Research Asia 1 Tsinghua University 2 University of Munich.
Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.
Segmentation and Clustering. Segmentation: Divide image into regions of similar contentsSegmentation: Divide image into regions of similar contents Clustering:
Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
4. Ad-hoc I: Hierarchical clustering
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 Quicklink Selection for Navigational Query Results Deepayan Chakrabarti Ravi Kumar Kunal Punera
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Clustering Unsupervised learning Generating “classes”
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Graph-based Segmentation. Main Ideas Convert image into a graph Vertices for the pixels Vertices for the pixels Edges between the pixels Edges between.
Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in N-D Images (Fri) Young Ki Baik, Computer Vision Lab.
Models and Algorithms for Complex Networks Graph Clustering and Network Communities.
1 Visual Segmentation-Based Data Record Extraction from Web IEEE Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
tch?v=Y6ljFaKRTrI Fireflies.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
Algorithmic Detection of Semantic Similarity WWW 2005.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
Post-Ranking query suggestion by diversifying search Chao Wang.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
LexPageRank: Prestige in Multi-Document Text Summarization Gunes Erkan, Dragomir R. Radev (EMNLP 2004)
Graph Algorithms for Vision Amy Gale November 5, 2002.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Real-Time Hierarchical Scene Segmentation and Classification Andre Uckermann, Christof Elbrechter, Robert Haschke and Helge Ritter John Grossmann.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Semi-Supervised Clustering
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
“Traditional” image segmentation
Presentation transcript:

1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar Kunal Punera

2 Motivation and Related Work Header Navigation bar Primary content Related links Copyright Ad

3 Motivation and Related Work Header Navigation bar Primary content Related links Copyright Ad Divide a webpage into visually and semantically cohesive sections

4 Motivation and Related Work Sectioning can be useful in:  Webpage classification  Displaying webpages on mobile phones and small-screen devices  Webpage ranking  Duplicate detection  …

5 Motivation and Related Work A lot of recent interest  Informative Structure Mining [Cai+/2003, Kao+/2005]  Displaying webpages on small screens [Chen+/2005, Baluja/2006]  Template detection: [Bar-Yossef+/2002]  Topic distillation: [Chakrabarti+/2001] Based solely on visual, or content, or DOM based clues Mostly heuristic approaches

6 Motivation and Related Work Our contributions  Combine visual, DOM, and content based cues  Propose a formal graph-based combinatorial optimization approach  Develop two instantiations, both with: Approximation guarantees Automatic determination of the number of sections  Develop methods for automatic learning of graph weights

7 Outline Motivation and Related Work Proposed Work Experiments Conclusions

8 Proposed Work A graph-based approach  Construct a neighborhood graph of DOM tree nodes  Neighbors  close according to: DOM tree distance, or, visual distance when rendered on the screen, or, similar content types  Partition the neighborhood graph to optimize a cost function A B DCE DOM Tree A B CD E Neighborhood Graph

9 Proposed Work A graph-based approach  What is a good cost function? Intuitive Has polynomial-time algorithms that can get provably close to the optimal  Correlation Clustering  Energy-minimizing Graph Cuts  How should we set weights in the neighborhood graph? A B DCE A B CD E DOM Tree Neighborhood Graph

10 Correlation Clustering Assign each DOM node p to a section S(p) V pq are edge weights in the neighborhood graph A B CD E Neighborhood Graph V AB V AE V BC Penalty for having DOM nodes p and q in different sections

11 Correlation Clustering Rendering Constraint:  Each pixel on the screen must belong to at most one section  Parent section = child section  Constraint only applies to DOM nodes “aimed” at visual rendering A C B S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c DOM Tree

12 Correlation Clustering Rendering Constraint:  Each pixel on the screen must belong to at most one section  Not enforced by CCLUS Workaround: Use only leaf nodes in the neighborhood graph  But content cues may be too noisy at the leaf level A C B S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c DOM Tree

13 Correlation Clustering Algorithm: [Ailon+/2005]  Pick a random leaf node p  Create a new section of p, and all nodes q which are strongly connected to p:  Remove p and q’s from the neighborhood graph  Iterate Within a factor of 2 of the optimal Number of sections picked automatically

14 Proposed Work A graph-based approach  What is a good cost function? Intuitive Has polynomial-time algorithms that can get provably close to the optimal  Correlation Clustering  Energy-minimizing Graph Cuts  How should we set weights in the neighborhood graph? A B DCE A B CD E DOM Tree Neighborhood Graph

15 Energy-minimizing Graph Cuts Extra: A predefined set of labels Assign to each node p a label S(p) Distance of node to label Distance between pairs of nodes

16 Energy-minimizing Graph Cuts Difference from CCLUS:  Node weights D p in addition to edge weights V pq  D p and V pq can depend on the labels (not just “same” or “different”) A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE Distance of node to label Distance between pairs of nodes

17 C Energy-minimizing Graph Cuts How can we fit the Rendering Constraint?  Have a special “invisible” label ξ  Parent is invisible, unless all children have the same label  Can set the V pq values accordingly A B S A =? ξ

18 C Energy-minimizing Graph Cuts How can we fit the Rendering Constraint?  Have a special “invisible” label ξ  Parent is invisible, unless all children have the same label  Can set the V pq values accordingly  Automatically infer “rendering” versus “structural” DOM nodes A B

19 Energy-minimizing Graph Cuts Why couldn’t we use this trick in CCLUS as well?  CCLUS only asks: Are nodes p and q in the same section or not?  It cannot handle “special” sections like the invisible section  Hence, labels are giving us extra power

20 Energy-minimizing Graph Cuts Advantages  Can use all DOM nodes, while still obeying the Rendering Constraint  Better than CCLUS  Factor of 2 approximation of the optimal, by performing iterative min-cuts of specially constructed graphs We extend [Kolmogorov+/2004] Number of sections are picked automatically

21 Energy-minimizing Graph Cuts Theorem: V pq must obey the constraint  Separation cost ≥ Merge cost  Set V pq (different) >> V pq (same) for nodes that are extremely close  Cost minimization tries to place them in the same section

22 Energy-minimizing Graph Cuts Theorem: V pq must obey the constraint  Separation cost ≥ Merge cost  However, we cannot use V pq to push two nodes to be in different sections  Use D p instead

23 Energy-minimizing Graph Cuts To separate nodes p and q:  Ensure that either D p (α) or D q (α) is large, for any label α  So, assigning both p and q to the same label will be too costly Distance of node to label

24 Energy-minimizing Graph Cuts  Invisible label lets us use the parent-child DOM tree structure  Ensures that nodes with very different content or visual features are split up  Ensures that nodes with very similar content or visual features are merged

25 Proposed Work A graph-based approach  What is a good cost function? Intuitive Has polynomial-time algorithms that can get provably close to the optimal  Correlation Clustering  Energy-minimizing Graph Cuts  How should we set weights in the neighborhood graph? A B DCE A B CD E DOM Tree Neighborhood Graph

26 Learning graph weights Extract content and visual features from training data Learning V pq (.)  Learn a logistic regression classifier (prob. that p and q belong to the same section) A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE

27 Learning graph weights Extract content and visual features from training data Learning D p (.)  Training data does not provide labels  Set of labels = Set of DOM tree nodes in that webpage  D p (α) = distance in some feature space  Learn a Mahalanobis distance metric between nodes (distances within section < distances across sections) A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE

28 Outline Motivation and Related Work Proposed Work Experiments Conclusions

29 Experiments Manually sectioned 105 randomly chosen webpages to get 1088 sections Two measures were used:  Adjusted RAND: fraction of leaf node pairs which are correctly predicted to be together or apart (over and above random sectioning)  Normalized Mutual Information  Both are between 0 and 1, with higher values indicating better results.

30 Experiments CCLUS: Only 20% of the webpages score better than 0.6 GCUTS: Almost 50% of the webpages score better than 0.6 Adjusted RAND % webpages < score

31 Experiments GCUTS is better than CCLUS Over all webpages

32 Experiments Application to duplicate detection on the Web  Collected lyrics of the same songs from 3 different sites (~2300 webpages) Nearly similar content Different template structures  Our approach: Section all webpages Perform duplicate detection using only the largest section (primary content)

33 Experiments Sectioning > No sectioning GCUTS > CCLUS

34 Outline Motivation and Related Work Proposed Work Experiments Conclusions

35 Conclusions Combined visual, DOM, and content based cues Optimization on a neighborhood graph  Node and edge weights are learnt from training data Developed CCLUS and GCUTS, both with:  Approximation guarantees  Automatic determination of the number of sections

36 Learning graph weights Extract content and visual features from training data A B CD E Neighborhood Graph V AB V AE V BC DADA DBDB DEDE

37 Energy-minimizing Graph Cuts What is such a D p (.) function?  Use the set of internal DOM nodes as the set of labels  D p (α) measures the difference in feature vectors between node p and internal node (label) α  If nodes p and q are very different, D p (α) and D q (α) will differ for all α

38 Correlation Clustering Does not enforce the Rendering Constraint:  Each pixel on the screen must belong to at most one section  Parent nodes should have same section as their children Workaround: Consider only leaf nodes in the neighborhood graph  But content cues may be too noisy at the leaf level

39 Correlation Clustering Does not enforce the Rendering Constraint  Each pixel on the screen must belong to at most one section  Parent section = child section  Apply rule only for ancestors “aimed” at visual rendering A C B S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c

40 Correlation Clustering Does not enforce the Rendering Constraint Workaround: Consider only leaf nodes in the neighborhood graph  But content cues may be too noisy at the leaf level A C B S B =5S C =7 S A =? Either S A =S B =S C, or S A ≠S B and S A ≠S c

41 Energy-minimizing Graph Cuts How can we fit the Rendering Constraint?  Have a special “invisible” label ξ  Parent is invisible, unless all children have the same label  Can set the V pq values accordingly  Automatically infer “rendering” versus “structural” DOM nodes A C B S B =5S C =7 S A =? ξ S C =5 S A =5

42 Energy-minimizing Graph Cuts What is the set of labels?  The set of internal DOM nodes Available at the beginning of the algorithm The labels are themselves nodes, with feature vectors  D p (α) = distance in some feature space “Tuned” to the current webpage