Download presentation
Presentation is loading. Please wait.
1
HCC class lecture 14 comments John Canny 3/9/05
2
Administrivia
3
Clustering: LSA again The input is a matrix. Rows represent text blocks (sentences, paragraphs or documents) Columns are distinct terms Matrix elements are term counts (x tfidf weight) The idea is to “Factor” this matrix into A D B: Terms Text blocks Terms = MA B Text blocks Themes D
4
LSA again A encodes the representation of each text block in a space of themes. B encodes each theme with term weights. It can be used to explicitly describe the theme. Terms Text blocks Terms = MA B Text blocks Themes D
5
LSA limitations LSA has a few assumptions that don’t make much sense: – –If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. – –LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. – –SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
6
Non-negative Matrix Factorization NMF deals with non-negativity and orthogonality, but still uses gaussian statistics: – –If documents really do comprise different “themes” there shouldn’t be negative weights in the LSA matrices. – –LSA implicitly models gaussian random processes for theme and word generation. Actual document statistics are far from gaussian. – –SVD forces themes to be orthogonal in the A and B matrices. Why should they be?
7
LSA again The consequences are: – –LSA themes are not meaningful beyond the first few (the ones with strongest singular value). – –LSA is largely insensitive to the choice of semantic space (most 300-dim spaces will do).
8
NMF The corresponding properties: – –NMF components track themes well (up to 30 or more). – –The NMF components can be used directly as topic markers, so the choice is important.
9
NMF NMF is an umbrella term for several algorithms. The one in this paper uses least squares to match the original term matrix. i.e. it minimizes: (M – AB) 2 Another natural metric is the KL or Kullback-Liebler divergence. The KL-divergence between two probability distributions p and q is: p log p/q Another natural version of NMF uses KL-divergence between M and its approximation as A B.
10
NMF KL-divergence is usually a more accurate way to compare probability distributions. However, in clustering applications, the quality of fit to the probability distribution is secondary to the quality of the clusters. KL-divergence NMF performs well for smoothing (extrapolation) tasks, but not as well as least-squares for clustering. The reasons are not entirely clear, but it may simply be an artifact of the basic NMF recurrences, which find only locally-optimal matches.
11
A Simpler Text Summarizer A simpler text summarizer based on inter-sentence analysis did as well as any of the custom systems on the DUC-2002 dataset (Document Understanding Conference). This algorithm called “TextRank” was based on a graphical analysis of the similarity graph between sentences in the text.
12
A Simpler Text Summarizer Vertices in the graph represent sentences, edge weights are similarity between sentences: S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7
13
Textrank TextRank computes vertex strength using a variant of Google’s Pagerank. It gives the probability of being at a vertex during a long random walk on the graph. S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7
14
Textrank The highest-ranked vertices comprise the summary. Textrank achieved the same summary performance as the best single-sentence summarizers at DUC-2002. (TextRank appeared in ACL 2004)
15
Discussion Topics T1: The best text analysis algorithms for a variety of tasks seem to use numerical (BOW or graphical models) of texts. Discuss what information these representations capture and why they might be effective.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.