Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009
What is LSA? “A technique in natural language processing of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms” -Paraphrasing Wikipedia Based on the “bag of words” model
Timeline LSA patented in 1988 ◦ Deerwester et al., mostly psycology types from the University of Colorado pLSA by Hofmann in 1999 ◦ Assumes Poisson distribution of terms and documents instead of Gaussian Latent Dirichlet Allocation, Blei et al. in 2002 ◦ More of a graphical model
What you can do with LSA Start with a corpus of text, e.g., Wikipedia Create a term frequency matrix Do some fancy math (not too fancy, though) Output is a matrix you can use to project terms (or documents) into a lower- dimensional concept space ◦ “Mao Zedong” and “communism” should have a high dot product ◦ “Mao Zedong” and “pocket watch” should not
Intuition behind LSA “A is 5 furlongs away from B” “A is 5 furlongs away from C” “B is 8 furlongs away from C”
In two dimensions BC A 5 8
Noise in the measurements “A, B, and C are all on a straight, flat, road”
Dimension reduction BCA 9 4.5
Dimension reduction
Assumptions Suppose we want to do LSA on Wikipedia, with k=600 We’re assuming all authors draw from two sources when choosing words while writing an article ◦ A 600-dimensional “true” concept space ◦ Their freedom of choice which we model as white Gaussian noise
Process Build a term frequency matrix Do tf-idf weighting Calculate a singular value decomposition (SVD) Do a rank reduction by chopping off all but the k = 600 largest singular values Can now map terms (or documents) into a space defined by the 600-dimensional matrix approximation of our original matrix, compare them with a dot product or cosine similarity
Build a term frequency matrix
Do tf-idf weighting (optional) n i,j is # occurences of term i in document j Denominator is total # terms in document j Numerator is total # of documents, denominator is # documents where term i appears at least once (like entropy)
Calculate an SVD Why “an” SVD, not “the” SVD? Unitary matrix ◦ Normal ◦ Determinant =1 (length preserving) Dimension is the rank of a matrix An SVD exists for every matrix Fast, numerically stable, can stop at k, etc.
Rank reduction Rank reduction = dimension reduction Will give us the k-dimensional matrix that is the optimal approximation of our original matrix in terms of the Frobenius norm Rank reduction has the effect of reducing white Guassian noise
Map terms into concept space Using V instead of P (plagiarism from multiple sources) (or documents) Can compare terms using, e.g., cosine similarity
Example application ConceptDoppler 我的奋斗 (Mein Kampf), 转化率 (conversion rate), 绝食 (hunger strike) List changes dramatically at times ◦ 19 September 2007 – 122 out of ? ◦ 6 March 2008 – 108 out of ? ◦ 18 June 2008 – 133 words out of ? ◦ As of February 2009 法轮功 not blocked
Questions? Sources I plagiarized from: ◦ Wikipedia article on latent semantic analysis ◦ df df ◦ ConceptDoppler, Crandall et al. from CCS 2007