HCC class lecture 14 comments John Canny 3/9/05
Administrivia
Clustering: LSA again The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight) The idea is to “Factor” this matrix into A D B: Terms Text blocks Terms = MA B Text blocks Themes D
LSA again A encodes the representation of each text block in a space of themes. B encodes each theme with term weights. It can be used to explicitly describe the theme. Terms Text blocks Terms = MA B Text blocks Themes D
LSA limitations LSA has a few assumptions that don’t make much sense: – –If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. – –LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. – –SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
Non-negative Matrix Factorization NMF deals with non-negativity and orthogonality, but still uses gaussian statistics: – –If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. – –LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. – –SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
LSA again The consequences are: – –LSA themes are not meaningful beyond the first few (the ones with strongest singular value). – –LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).
NMF The corresponding properties: – –NMF components track themes well (up to 30 or more). – –The NMF components can be used directly as topic markers, so the choice is important.
NMF NMF is an umbrella term for several algorithms. The one in this paper uses least squares to match the original term matrix. i.e. it minimizes: (M – AB) 2 Another natural metric is the KL or Kullback-Liebler divergence. The KL-divergence between two probability distributions p and q is: p log p/q Another natural version of NMF uses KL-divergence between M and its approximation as A B.
NMF KL-divergence is usually a more accurate way to compare probability distributions. However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters. KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering. The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.
A Simpler Text Summarizer A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference). This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.
A Simpler Text Summarizer Vertices in the graph represent sentences, edge weights are similarity between sentences: S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7
Textrank TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph. S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7
Textrank The highest-ranked vertices comprise the summary. Textrank achieved the same summary performance as the best single-sentence summarizers at DUC (TextRank appeared in ACL 2004)
Discussion Topics T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.