Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science.

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
PCA + SVD.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Lecture 13 - Eigen-analysis CVEN 302 July 1, 2002.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
A General Model for Relational Clustering Bo Long and Zhongfei (Mark) Zhang Computer Science Dept./Watson School SUNY Binghamton Xiaoyun Wu Yahoo! Inc.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Segmentation Graph-Theoretic Clustering.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Ordinary least squares regression (OLS)
MOHAMMAD IMRAN DEPARTMENT OF APPLIED SCIENCES JAHANGIRABAD EDUCATIONAL GROUP OF INSTITUTES.
Clustering Unsupervised learning Generating “classes”
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Chapter 2 Dimensionality Reduction. Linear Methods
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Introduction to Data Structures. Definition Data structure is representation of the logical relationship existing between individual elements of data.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Non Negative Matrix Factorization
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
SINGULAR VALUE DECOMPOSITION (SVD)
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
A Note on Rectangular Quotients By Achiya Dax Hydrological Service Jerusalem, Israel
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Machine Learning Queens College Lecture 7: Clustering.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Dimensionality reduction
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
Single Document Key phrase Extraction Using Neighborhood Knowledge.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
K -means clustering via Principal Component Analysis (Chris Ding and Xiaofeng He, ICML 2004) 03 March 2011 Kwak, Namju 1.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
CSE 554 Lecture 8: Alignment
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Compact Query Term Selection Using Topically Related Text
Degree and Eigenvector Centrality
Segmentation Graph-Theoretic Clustering.
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Feature space tansformation methods
Generally Discriminant Analysis
Text Categorization Berlin Chen 2003 Reference:
Restructuring Sparse High Dimensional Data for Effective Retrieval
NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &
Presentation transcript:

Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science & Engineering Pennsylvania State University University Park, PA SIGIR ’ 02

Introduction Informally, the goal of text summarization is to take a textual document, extract content from it and present important content to the user in a condensed form and in a manner sensitive to the user ’ s or application ’ s needs. Two basic approach to sentence extraction Supervised approaches need human-generated summary extracts for feature extraction and parameter estimation, sentence classifiers are trained using human-generated sentence-summary pairs as training examples. We adopt the unsupervised approach, explicitly model both keyphrase and the sentences that contain them using weighted undirected and weighted bipartite graphs.

Introduction First cluster sentences of a (set of) document(s) into topical groups and then select the keyphrase and sentences by their saliency score within each group. Major contributions are Proposing the use of sentence link priors resulted from the linear order to enhance sentence clustering quality. Develop the mutual reinforcement principle for simultaneous keyphrase and sentence saliency score computation.

The Mutual Reinforcement Principle For each document we generate two sets of objects: the set of terms T = {t 1,…,t n }, the set of sentences S = {s 1,…,s m }. Build a weighted bipartite graph: if term t i appears in sentence s j, then create an edge and specify nonnegative weight w ij between them. We can simply choose w ij to be the number of times t i appears in s j, more sophisticated schemes will be discussed later. Hence G(T, S, W) is a weighted bipartite graph where W = [w ij ] is an m -by- n weight matrix, and we wish to compute saliency scores u(t i ) and v(s j ).

The Mutual Reinforcement Principle Mutual reinforcement principle: The saliency score of a term is determined by saliency scores of the sentence it appears in, and the saliency scores of a sentence is determined by the saliency scores of the terms it contains. Mathematically,

The Mutual Reinforcement Principle Written in matrix format, We can rank terms and sentences in decreasing order of their saliency scores and choose top n terms or sentences. Choose an initial value of v to be the vector of all ones, alternate between the following two steps until convergence: 1. Compute and normalize 2. Compute and normalize σ can be computed as upon convergence.

The Mutual Reinforcement Principle The above weighted bipartite graph can be extended by adding vertex weights to the sentences and/or the terms. For example, the weight of a sentence vertex can be increased if it contains certain bonus words. In general, let D T and D S be two diagonal matrices the diagonal elements of which represent the weights of the term and sentence, we compute the largest singular value triplet {u,σ,v} of the scaled matrix D T W D S.

Clustering Sentences into Topical Groups The saliency score can be more effective if it is applied within each topical group of a document. For sentence clustering we build an undirected weighted graph with vertices representing sentences and two sentences s i and s j are linked if there are terms shared by them, weight w ij indicates the similarity between s i and s j, and there are many different ways for their specification. Sentences arranged in linear order, and near-by sentences tend to be about the same topic. Topical groups are usually made of sections of consecutive sentences is a strong prior which we call sentence link prior.

Incorporating Sentence Link Priors We call s i and s j are near-by if s i is followed by s j. A simple approach to take advantage of sentence link prior is to modify the weights: We call α the sentence link prior, and use the idea of generalized cross-validation (GCV) to choose α. Note that incorporating sentence link prior is different from text segmentation: we do allow several sections of consecutive sentences to form a single topical group.

Incorporating Sentence Link Priors For fixed α, apply the spectral clustering technique to obtain a set of Π*(α). Define to γ (Π) be the number of consecutive sentence segments it generates, we compute a function of α as: We then select the α that maximizes the above function as the estimated optimal α value.

Sum-of-Squares Cost Function and Spectral Relaxation In the bipartite graph G(T, S, W) each sentence is represented by a column of W = [w 1,…w n ] which we call sentence vector. A partition Π can be written as: E is a permutation matrix, and W i is m -by- n i. For a given partition, the sum-of-squares cost function is: m i is the centroid of the sentence vectors in cluster i, and n i is the number of sentences in cluster i.

Sum-of-Squares Cost Function and Spectral Relaxation Traditional K-means algorithm is iterative and in each iteration the following is performed: For each w, find m i that is closest to it, associate w with this m i. Compute a new set of centers. Major drawback is prone to local minima giving rise to very few data points. An equivalent formulation can be derived as a matrix trace max. problem, it also makes K-means method easily adaptable to utilizing the sentence link priors.

Sum-of-Squares Cost Function and Spectral Relaxation Let e be a vector of appropriate dimension with all elements equal to one, thus The sum-of-squares cost function can be written as: And its minimum is equivalent to: Let X be an arbitrary orthonormal matrix, we obtain a relaxed matrix trace max. problem:

Sum-of-Squares Cost Function and Spectral Relaxation An extension of the Rayleigh-Ritz characterization of eigenvalues of symmetric matrices shows that the above maximum is achieved by the first k largest eigenvectors of the Gram matrix W T W. We also have the following inequality: This gives a lower bound for the minimum of the sum-of- squares cost function. In particular, we can replace W T W by W S (α) after incorporating the link strength.

Sum-of-Squares Cost Function and Spectral Relaxation The clustering label assignment is done by QR decomposition with pivoting: Compute the k eigenvectors V k = [v 1,…,v k ] of W S (α) corresponding to the largest k eigenvalues. Compute the pivoted QR decomposition of V k T as where Q is a k -by- k orthogonal matrix, R 11 is a k -by- k upper triangular matrix, and P is a permutation matrix. Compute Then the clustering number is determined by the row index of the largest element in absolute value of the corresponding column of R hat.

Experimental Results Evaluation is a challenging task: human-generated summarization tends to be different; another approach is to extrinsically evaluate their performance on, for example, document retrieval or text categorization. We collect 10 documents, manually divide each one into topical groups. Notice that the clustering is not unique, some clusters can merge into a bigger cluster and some can be split into finer structures.

Experimental Results

In processing the documents, Delete stop words and applied Porter ’ s stemming. Construct W S = (w ij ) : each sentence is represented by a column of W, and w ij is equal to the dot-product of s i and s j. The sentence vectors are weighted with tf.idf weighting and normalized to have Euclidean length one. To measure the quality of clustering, we assume the manually generated section number is the true cluster label. Here we use a greedy algorithm to compute a sub-optimal solution.

Experimental Results For a sequence of α, apply the spectral clustering algorithm to the weight matrix W S (α) of the document dna. We also plot the clustering accuracy against α, and contrast the clustering result with and without sentence link priors. The clustering algorithm matches the section structure poorly when there is no near-by sentence constraints (i.e. α = 0); with too large α,sentence similarities are overwhelmed by link strength, the results are also poor.

Experimental Results

GCV method is quite efficient at choosing good α. In Table 1, the estimated α may differ from optimal α but still produces accuracy matches well the best ones.

Experimental Results For the computation of keyphrase and sentence saliency scores, apply the weight when applying the mutual reinforcement principle. For the i -th sentence apply the weight: The idea is to mitigate the influence of long sentences by scaling by a factor proportional to the sentence length, at the same time sentences close to the beginning of the document get a small boost.

Experimental Results We use the document dna for illustration. For α = 3.5, the clustering matches section structure well except for cluster 8. Sentences 1 to 4 discuss issues related the common ancestor “ Eve ”, and section 4 with heading Defining mitochondrial ancestors is about the same topic. Here sentence similarities win over sentence link strength. We also applied the mutual reinforcement principle to all sentences and extracted first few keywords and sentences.

Conclusions We presented a novel method for simultaneous keyphrase extraction and generic text summarization. Exploring the sentence link priors embedded in the linear order of a document to enhance clustering quality. Also develop the mutual reinforcement principle to compute keyphrases and saliency scores within each topical groups. Many issues need further investigation: More research needs to be done for choosing optimal α. Other possible way for clustering, for example, using 2-stage method: 1) segment sentences 2)cluster into topical groups. Replacing the use of simple terms by noun phrases, this will impact W and W S. Extension to translingual summarization.