Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science & Engineering Pennsylvania State University University Park, PA SIGIR ’ 02
Introduction Informally, the goal of text summarization is to take a textual document, extract content from it and present important content to the user in a condensed form and in a manner sensitive to the user ’ s or application ’ s needs. Two basic approach to sentence extraction Supervised approaches need human-generated summary extracts for feature extraction and parameter estimation, sentence classifiers are trained using human-generated sentence-summary pairs as training examples. We adopt the unsupervised approach, explicitly model both keyphrase and the sentences that contain them using weighted undirected and weighted bipartite graphs.
Introduction First cluster sentences of a (set of) document(s) into topical groups and then select the keyphrase and sentences by their saliency score within each group. Major contributions are Proposing the use of sentence link priors resulted from the linear order to enhance sentence clustering quality. Develop the mutual reinforcement principle for simultaneous keyphrase and sentence saliency score computation.
The Mutual Reinforcement Principle For each document we generate two sets of objects: the set of terms T = {t 1,…,t n }, the set of sentences S = {s 1,…,s m }. Build a weighted bipartite graph: if term t i appears in sentence s j, then create an edge and specify nonnegative weight w ij between them. We can simply choose w ij to be the number of times t i appears in s j, more sophisticated schemes will be discussed later. Hence G(T, S, W) is a weighted bipartite graph where W = [w ij ] is an m -by- n weight matrix, and we wish to compute saliency scores u(t i ) and v(s j ).
The Mutual Reinforcement Principle Mutual reinforcement principle: The saliency score of a term is determined by saliency scores of the sentence it appears in, and the saliency scores of a sentence is determined by the saliency scores of the terms it contains. Mathematically,
The Mutual Reinforcement Principle Written in matrix format, We can rank terms and sentences in decreasing order of their saliency scores and choose top n terms or sentences. Choose an initial value of v to be the vector of all ones, alternate between the following two steps until convergence: 1. Compute and normalize 2. Compute and normalize σ can be computed as upon convergence.
The Mutual Reinforcement Principle The above weighted bipartite graph can be extended by adding vertex weights to the sentences and/or the terms. For example, the weight of a sentence vertex can be increased if it contains certain bonus words. In general, let D T and D S be two diagonal matrices the diagonal elements of which represent the weights of the term and sentence, we compute the largest singular value triplet {u,σ,v} of the scaled matrix D T W D S.
Clustering Sentences into Topical Groups The saliency score can be more effective if it is applied within each topical group of a document. For sentence clustering we build an undirected weighted graph with vertices representing sentences and two sentences s i and s j are linked if there are terms shared by them, weight w ij indicates the similarity between s i and s j, and there are many different ways for their specification. Sentences arranged in linear order, and near-by sentences tend to be about the same topic. Topical groups are usually made of sections of consecutive sentences is a strong prior which we call sentence link prior.
Incorporating Sentence Link Priors We call s i and s j are near-by if s i is followed by s j. A simple approach to take advantage of sentence link prior is to modify the weights: We call α the sentence link prior, and use the idea of generalized cross-validation (GCV) to choose α. Note that incorporating sentence link prior is different from text segmentation: we do allow several sections of consecutive sentences to form a single topical group.
Incorporating Sentence Link Priors For fixed α, apply the spectral clustering technique to obtain a set of Π*(α). Define to γ (Π) be the number of consecutive sentence segments it generates, we compute a function of α as: We then select the α that maximizes the above function as the estimated optimal α value.
Sum-of-Squares Cost Function and Spectral Relaxation In the bipartite graph G(T, S, W) each sentence is represented by a column of W = [w 1,…w n ] which we call sentence vector. A partition Π can be written as: E is a permutation matrix, and W i is m -by- n i. For a given partition, the sum-of-squares cost function is: m i is the centroid of the sentence vectors in cluster i, and n i is the number of sentences in cluster i.
Sum-of-Squares Cost Function and Spectral Relaxation Traditional K-means algorithm is iterative and in each iteration the following is performed: For each w, find m i that is closest to it, associate w with this m i. Compute a new set of centers. Major drawback is prone to local minima giving rise to very few data points. An equivalent formulation can be derived as a matrix trace max. problem, it also makes K-means method easily adaptable to utilizing the sentence link priors.
Sum-of-Squares Cost Function and Spectral Relaxation Let e be a vector of appropriate dimension with all elements equal to one, thus The sum-of-squares cost function can be written as: And its minimum is equivalent to: Let X be an arbitrary orthonormal matrix, we obtain a relaxed matrix trace max. problem:
Sum-of-Squares Cost Function and Spectral Relaxation An extension of the Rayleigh-Ritz characterization of eigenvalues of symmetric matrices shows that the above maximum is achieved by the first k largest eigenvectors of the Gram matrix W T W. We also have the following inequality: This gives a lower bound for the minimum of the sum-of- squares cost function. In particular, we can replace W T W by W S (α) after incorporating the link strength.
Sum-of-Squares Cost Function and Spectral Relaxation The clustering label assignment is done by QR decomposition with pivoting: Compute the k eigenvectors V k = [v 1,…,v k ] of W S (α) corresponding to the largest k eigenvalues. Compute the pivoted QR decomposition of V k T as where Q is a k -by- k orthogonal matrix, R 11 is a k -by- k upper triangular matrix, and P is a permutation matrix. Compute Then the clustering number is determined by the row index of the largest element in absolute value of the corresponding column of R hat.
Experimental Results Evaluation is a challenging task: human-generated summarization tends to be different; another approach is to extrinsically evaluate their performance on, for example, document retrieval or text categorization. We collect 10 documents, manually divide each one into topical groups. Notice that the clustering is not unique, some clusters can merge into a bigger cluster and some can be split into finer structures.
Experimental Results
In processing the documents, Delete stop words and applied Porter ’ s stemming. Construct W S = (w ij ) : each sentence is represented by a column of W, and w ij is equal to the dot-product of s i and s j. The sentence vectors are weighted with tf.idf weighting and normalized to have Euclidean length one. To measure the quality of clustering, we assume the manually generated section number is the true cluster label. Here we use a greedy algorithm to compute a sub-optimal solution.
Experimental Results For a sequence of α, apply the spectral clustering algorithm to the weight matrix W S (α) of the document dna. We also plot the clustering accuracy against α, and contrast the clustering result with and without sentence link priors. The clustering algorithm matches the section structure poorly when there is no near-by sentence constraints (i.e. α = 0); with too large α,sentence similarities are overwhelmed by link strength, the results are also poor.
Experimental Results
GCV method is quite efficient at choosing good α. In Table 1, the estimated α may differ from optimal α but still produces accuracy matches well the best ones.
Experimental Results For the computation of keyphrase and sentence saliency scores, apply the weight when applying the mutual reinforcement principle. For the i -th sentence apply the weight: The idea is to mitigate the influence of long sentences by scaling by a factor proportional to the sentence length, at the same time sentences close to the beginning of the document get a small boost.
Experimental Results We use the document dna for illustration. For α = 3.5, the clustering matches section structure well except for cluster 8. Sentences 1 to 4 discuss issues related the common ancestor “ Eve ”, and section 4 with heading Defining mitochondrial ancestors is about the same topic. Here sentence similarities win over sentence link strength. We also applied the mutual reinforcement principle to all sentences and extracted first few keywords and sentences.
Conclusions We presented a novel method for simultaneous keyphrase extraction and generic text summarization. Exploring the sentence link priors embedded in the linear order of a document to enhance clustering quality. Also develop the mutual reinforcement principle to compute keyphrases and saliency scores within each topical groups. Many issues need further investigation: More research needs to be done for choosing optimal α. Other possible way for clustering, for example, using 2-stage method: 1) segment sentences 2)cluster into topical groups. Replacing the use of simple terms by noun phrases, this will impact W and W S. Extension to translingual summarization.