Download presentation
Presentation is loading. Please wait.
Published byPierce Turner Modified over 9 years ago
1
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
2
Hypertext categorization Automatic topic identification Also called “supervised learning” Given Hypertext document corpus A “small” set of classified documents Goal Construct a classifier Apply to new documents
3
Example from the web
4
Applications and benefits Retrieval Browsing (Yahoo!) Searching (“socks” and NOT “apparel”) Adopted by most search companies Profile based filtering and routing Email, news, “push” services Collaborative filtering Automatically categorize click trails Cluster users based on frequently visited topics
5
Click-trail and bookmark organizer Integrated browser View of topic Hierarchy Web Page
6
The limitation of text-only classifiers Text-only classifiers are well-researched Rule induction Bayesian learning 87% accurate on news Lower accuracy on hyperlinked corpora Heterogenous Information in links not utilized
7
Our contributions A novel approach to hypertext classification Combine text and link information Framework for link modeling in hypertext graphs Markov random field (limited “sphere of influence”) Techniques for feature extraction Use of domain knowledge to limit complexity Techniques to handle incomplete information Iterative labeling algorithm
8
Is this a new problem? Reduction to text classification Include (tagged) text from neighbors Classify the result Does not increase accuracy Big neighbor pages Lack of semantic correlation
9
“Big neighbor”
10
More of “big neighbor”
11
Coherent pages linking to incoherent pages
12
Model specification A hypertext graph Nodes = documents Edges = hyperlinks Document = sequence or set of terms and links Each document has a class label Some labels are known Most are unknown Labels are drawn from some distribution
13
Assumptions used in probability model No indirect coupling between the text and the neighbors’ classes The probability of a node’s class depends only on neighbors within limited radius Independence among the neighbor class probabilities Can assume higher order dependence (neighborhood radius greater than 1)
14
Probability estimation Posterior probability of class given text and neighborhood Prior class probability Class conditional term distribution Class conditional neighbor class distribution (independenc e between neighbors)
15
Bayesian classification algorithm Learning phase (parameter estimation) Distribution of a text within a class Interclass linkage probabilities Prior probability of a class Classification phase Compute class probabilities Choose the class with highest posterior probability
16
Partial neighborhood knowledge Problem: Class of test page depends on neighbors’ classes Must know neighbor’s classes to use interclass probabilities circularity! Solution: Iterative labeling Initially classify neighboring nodes using text Repeatedly reclassify until consistent Text, link, or joint model Will this stabilize?
17
Data set 1: US patent database Local text information Title Abstract Citation links Related patents cite each other Complete knowledge of the neighbors’ classes
18
Complete knowledge of neighborhood Features used: Local text Class tags from neighbor links Large gain from tags Gains sensitive to tag representation: /Arts /Arts/Painting
19
Partial knowledge of neighborhood Algorithm: Grow radius-two neighborhood Delete labels from a fraction of nodes Do iterative labeling Observations: Benefit from links Text+Link most robust
20
Data set 2: Yahoo! Few links point to classified documents 19% of docs have any classified out-link 28% has any classified in-link 40% has either one Need to find new source of information and extend the algorithm
21
Radius-2 information: co-citations Document to be classified Bridge Classified document Unclassified document I-link O-link An “IO-bridge” connects to many pages of similar topics “OI” tends to be noisy (many topics point to Netscape and Free Speech Online) “II” and “OO” lead to topic divergence IOOIII/OO
22
Link proximity Bridge Link#1 …... Link# i-1 Link# i Link # i+1 … Document to be classified Art Music Unknown Are out-links that are close together more likely to point to related topics than out- links that are far apart?
23
Bridges are locally coherent Link proximity semantic proximity Exploit this source of information Huge attribute space Simple classification Check coherence Voting
24
Effect of exploiting bridges and locality
25
Conclusions New model for citation among hyperlinked documents belonging to various topics New categorization algorithm Complexity controlled using domain knowledge about citations Significant increase in accuracy
26
Future work Better models for joint distribution between terms and links Semantic page segmentation to distill “pure” bridges from ones having a mixture of topics Higher complexity Potentially better results More clever use of neighbors’ text Investigation of the relationship between spatial and semantic proximity
27
Related work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.