LexPageRank: Prestige in Multi- Document Text Summarization Gunes Erkan and Dragomir R. Radev Department of EECS, School of Information University of Michigan ACL 2004
2/22 Abstract This paper consider an approach for computing sentence importance based on the concept of eigenvector centrality (prestige) – LexPageRank In this model, a sentence connectivity matrix is constructed based on cosine similarity The experimental results using DUC2004 show that this approach outperforms centroid-based summarization and is quite successful compared to other summarization systems
3/22 Introduction Text summarization is the process of automatically creating a compressed version of a given text that provides useful information for the user This summarization approach is to assess the centrality of each sentence in a cluster and include the most important ones in the summary –Introduce two new measures for centrality, Degree and LexPageRank, inspired from the prestige concept in social networks
4/22 Sentence centrality and centroid- based summarization Extractive summarization produces summaries by choosing a subset of the sentences in the original documents Centrality of a sentence is often defined in terms of the centrality of the words that it contains The centroid of a cluster is a psuedo-document which consists of words that have frequency*IDF scores above a predefined threshold In centroid-based summarization (Radevet et al., 2000), the sentences that contain more words from the centroid of the cluster are considered central –Centroid-based summarization has given promising results in the past
5/22 Prestige-based sentence centrality We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or prestigious) to the topic There are two issues –How to define similarity between two sentences Cosine –How to compute the overall prestige of a sentence given its similarity to other sentences Degree centrality Eigenvector centrality and LexPageank
6/22 Prestige-based sentence centrality A cluster may be represented by a cosine similarity matrix
7/22 Prestige-based sentence centrality Most of them are nonzero
8/22 Prestige-based sentence centrality Degree centrality –Since we are interested in significant similarities in the matrix, we can eliminate some low values by defining a threshold, so that the cluster can be view as an undirected graph –We define degree centrality as the degree of each node in the similarity graph
9/22 Prestige-based sentence centrality
10/22 Prestige-based sentence centrality
11/22 Prestige-based sentence centrality Issue for degree centrality –Several unwanted sentences vote for each and raise their prestige –This situation can be avoided by considering where the votes come from and taking the prestige of the voting node into account in weight each node Eigenvector centrality and LexPageRank –PageRank (Page et al., 1998) is a method propose for assigning a prestige score to each page in the web independent of a specific query Depending on the number of pages that link to that pages as well as the individual score of the linking pages
12/22 Prestige-based sentence centrality The PageRank of Page A This recursively defined value can be computed by forming the binary adjacency matrix of the web, normalizing this matrix so that row sums equal to 1, and finding the principal eigenvector of the normalized matrix PageRank for ith pages equals to the ith entry in the eigenvector T 1,…,T n : pages that link to page A d: damping factor, C(T i ): the number of outgoing links from page T i
13/22 Prestige-based sentence centrality This method can be easily applied to the cosine similarity graph to find the most prestigious sentences in a document We called this new measure of sentence similarity LexPageRank
14/22 Prestige-based sentence centrality damping factor = 1
15/22 Prestige-based sentence centrality Advantage over Centroid –It accounts for information subsumption among sentences –It prevents unnaturally high IDF scores from boosting up the score of a sentence that is unrelated to the topic
16/22 Experiments on DUC 2004 data DUC 2004 data was used in our experiments Task 2 involves summarization of 50 TDT English clusters Task 4 is to produce summaries of machine translation output (in English) of 24 Arabic TDT documents Recall-based measure – Rouge is adopted and 665-byte summaries for each cluster are produced
17/22 Experiments on DUC 2004 data MEAD summarization toolkit –Extractive multi-document summarization –Consist of three components Feature extractor (document -> feature vector) –Centroid, Position and Length Combiner (feature vector -> scalar value) Reranker (the scores are adjusted upward or downward) –MMR (Maximum Margin Relevance), CSIS (Cross-Sentence Information Subsumption) weight Threshold
18/22 Experiments on DUC 2004 data Centroid
19/22 Experiments on DUC 2004 data
20/22 Experiments on DUC 2004 data
21/22 Experiments on DUC 2004 data
22/22 Conclusions A novel approach to define sentence centrality based on graph-based prestige scoring of sentences We have introduced two different methods, Degree and LexPageRank, for computing prestige in similarity graph The experimental results is quite promising Even the simplest approach, degree centrality, is good enough heuristic to perform better than lead-based and centroid-based summaries