Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg , Saarbrücken, Germany Present by Chia-Hao Lee
2 outline Introduction Graph-based Classification Incorporating Metric Label Distances Experimental Conclusion
3 Introduction Automatic classification is a supervised learning technique for assigning thematic categories to data items such as customer records, gene-expression data records, Web pages, or text documents. The standard approach is to represent each data item by a feature vector and learn parameters of mathematical decision models. Context-free : the decision is based only on the feature vector of a given data item, disregarding the other data items in the test set.
4 Introduction In many settings, this “context-free” approach does not exploit the available information about relationships between data items. Using the relationship information, we can construct a graph G in which each data item is a node and each relationship instance forms an edge between the corresponding nodes. In the following we will mostly focus on text documents with links to and from other documents.
5 Introduction A straightforward approach to capturing a document’s neighbors would be to incorporate the features and feature weights of the neighbors into the feature vector of the given document itself. A more advanced approach is to model the mutual influence between neighboring documents, aiming to estimate the class labels of all test documents simultaneously.
6 Introduction A simple example for RL (Relaxation labeling) is shown in figure 1. Let our set of class be. We wish to assign to every document marked “?” its most probable label. Let the contingency matrix in figure 1b) be estimated from the training data.
7 Introduction The theory paper by Kleinberg and Tardos views the classification problem for nodes in an undirected graph as a metric labeling problem where we aim to optimize a combinatorial function consisting of assignment costs and separation costs.
8 Graph-Based Classification Our approach is based on the probabilistic formulation of the classification problem and uses a relaxation labeling technique to derive two major approaches for finding the maximally likely labeling λ of the given test graph: hard and soft labeling. D : a set of documents G : a graph whose vertices correspond to documents and edges represent the link structure of D. : the label of node u. : the feature vector that locally captures the content of document d.
9 Graph-Based Classification Taking into account the underlying link structure and document d’ s context-based feature vector, the probability of a label to be assigned to d is : In the spirit of the introduction’s discuss on emphasizing the influence of the immediate neighbors for each document,,we obtain and denote it by. The independent of the labels of other nodes in the graph given the labels of its immediate neighbors. We abbreviate into.
10 Graph-Based Classification We abbreviate,the graph-unaware probability based only on d ’s local content, by. The additional independence assumption that there is no direct of its coupling between the content of a document and the labels of its neighbors, the following central equation holds for the total probability, summing up the posterior probabilities for all possible labelings of the neighborhood:
11 Graph-Based Classification In the same vein, if we further assume independence among all neighbor labels of the same node, we reach the following formulation for our neighborhood-conscious classification problem: This can be computed in an iterative manner as follow:
12 Graph-Based Classification Hard labeling : In contrast to the presented soft labeling approach, we also consider a method that take into account only the most probable label assignments in the test document neighborhood to be significant for the computation. Let be the maximum probable label :
13 Graph-Based Classification Soft Labeling : The soft labeling approach aims to achieve better accuracy of the classification by avoiding the overly eager “rounding” that the hard labeling approach does.
14 Incorporating Metric Label Distance Intuitively, neighboring documents should receive similar class labels. For example, suppose we have a set of classes and we wish to find the most probable label for a test document d. A document discussing scientific problems ( S ) would be much farther away from both C and E. So, a similarity metric imposed on the set of labels C would have high values for the pair ( C, E ) and small values for class pairs ( C, S ) and ( E, S ).
15 Incorporating Metric Label Distance This is why introducing a metric should help improve the classification result. In this metric, similar classes are separated by a shorter distance and impose smaller separation cost on an edge labeling. Our approach, on the other hand, is general, and we construct the metric Γ automatically from the training data. We incorporate the label metric into the iterations for computing the probability of an edge labeling by treating as a scaling factor.
16 Incorporating Metric Label Distance This way, we magnify the impact of edges between nodes with similar labels and scale down the impact of edges between dissimilar ones:
17 Experiments We have tested our graph-based classifier on three different data sets. The first one includes approximately 16000scientific publications chosen from the DBLP database. The second dataset has been selected from the internet movie database IMDB. The third dataset used in the experiments was the online encyclopedia Wikipedia.
18 Experiments
19 Experiments
20 Experiments
21 Experiments
22 Experiments
23 Conclusion The presented GC method for graph-based classification is a way of exploiting context relationships of data items. Incorporating metric distances among different labels contributed to the very good performance of GC method. This is one new form of exploiting knowledge about the relationships among category labels and thus the structure of the classifier’s target space.