14.0 Linguistic Processing and Latent Topic Analysis

Slides:



Advertisements
Similar presentations
Information retrieval – LSI, pLSI and LDA
Advertisements

Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Dimensionality Reduction PCA -- SVD
CS 599: Social Media Analysis University of Southern California1 Elementary Text Analysis & Topic Modeling Kristina Lerman University of Southern California.
Segmentation and Fitting Using Probabilistic Methods
Statistical Topic Modeling part 1
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Latent Dirichlet Allocation a generative model for text
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Probabilistic Latent Semantic Analysis
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Generic text summarization using relevance measure and latent semantic analysis Gong Yihong and Xin Liu SIGIR, April 2015 Yubin Lim.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
SINGULAR VALUE DECOMPOSITION (SVD)
CS Statistical Machine learning Lecture 24
Topic Modeling using Latent Dirichlet Allocation
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
Latent Dirichlet Allocation
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Link Distribution on Wikipedia [0407]KwangHee Park.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
14.0 Linguistic Processing and Latent Topic Analysis.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
The topic discovery models
Multimodal Learning with Deep Boltzmann Machines
Learning Sequence Motif Models Using Expectation Maximization (EM)
Latent Semantic Indexing
Efficient Estimation of Word Representation in Vector Space
The topic discovery models
LSI, SVD and Data Management
PCA vs ICA vs LDA.
Latent Dirichlet Analysis
Advanced Artificial Intelligence
LTI Student Research Symposium 2004 Antoine Raux
Probabilistic Models with Latent Variables
Representation of documents and queries
The topic discovery models
Bayesian Inference for Mixture Language Models
Word Embedding Word2Vec.
Topic models for corpora and for graphs
The European Conference on e-learing ,2017/10
Biointelligence Laboratory, Seoul National University
Matrix Factorization & Singular Value Decomposition
Michal Rosen-Zvi University of California, Irvine
Large scale multilingual and multimodal integration
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Topic Models in Text Processing
CS 430: Information Discovery
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Recommendation Systems
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Latent Semantic Analysis
Introduction to Digital Speech Processing
Presentation transcript:

14.0 Linguistic Processing and Latent Topic Analysis

Latent Semantic Analysis (LSA) Tk Topic Words Documents

Latent Semantic Analysis (LSA) - Word-Document Matrix Representation Vocabulary V of size M and Corpus T of size N V={w1,w2,...wi,..wM} , wi: the i-th word ,e.g. M=2×104 T={d1,d2,...dj,..dN} , dj: the j-th document ,e.g. N=105 cij: number of times wi occurs in dj nj: total number of words present in dj ti = Σj cij : total number of times wi occurs in T Word-Document Matrix W W = [wij] each row of W is a N-dim “feature vector” for a word wi with respect to all documents dj each column of W is a M-dim “feature vector” for a document dj with respect to all words wi d1 d2 ........ dj .......... dN w1 w2 wi wM wij

Latent Semantic Analysis (LSA) j

Dimensionality Reduction (1/2) dimensionality reduction: selection of R largest eigenvalues (R=800 for example) T 2 R “concepts” or “latent semantic concepts”

Dimensionality Reduction (2/2) dimensionality reduction: selection of R largest eigenvalues 2 T si2 : weights (significance of the “component matrices” e′i e′iT) R “concepts” or “latent semantic concepts” (i, j) element of WT W : inner product of i-th and j-th columns of W “similarity” between di and dj N N

Singular Value Decomposition (SVD) si: singular values, s1≥ s2.... ≥ sR U: left singular matrix, V: right singular matrix Vectors for word wi: uiS=ui (a row) a vector with dimensionality N reduced to a vector uiS=ui with dimensionality R N-dimensional space defined by N documents reduced to R-dimensional space defined by R “concepts” the R row vectors of VT, or column vectors of V, or eigenvectors {e′1,..e′R}, are the R orthonormal basis for the “latent semantic space” with dimensionality R, with which uiS = u i is represented words with similar “semantic concepts” have “closer” location in the “latent semantic space” they tend to appear in similar “types” of documents, although not necessarily in exactly the same documents d1 d2 ........ dj . ......... dN w1 w2 wi wM wij = ui U R×R s1 sR d1 d2 ........ dj .......... dN VT R×N M×R vj T

Singular Value Decomposition (SVD) dp=USvpT (just as a column in W= USVT)

Singular Value Decomposition (SVD) Vectors for document dj: vjS=vj (a row, or vj = S vjT for a column) a vector with dimensionality M reduced to a vector vjS=vj with dimensionality R M-dimensional space defined by M words reduced to R-dimensional space defined by R “concepts” the R columns of U, or eigenvectors{e1,...eR}, are the R orthonormal basis for the “latent semantic space” with dimensionality R, with which vjS=vj is represented documents with similar “semantic concepts” have “closer” location in the “latent semantic space” they tend to include similar “types” of words, although not necessarily exactly the same words The Association Structure between words wi and documents dj is preserved with noisy information deleted, while the dimensionality is reduced to a common set of R “concepts” d1 d2 ........ dj . ......... dN w1 w2 wi wM wij = ui U R×R s1 sR d1 d2 ........ dj .......... dN vj VT R×N M×R T T

Example Applications in Linguistic Processing Word Clustering example applications: class-based language modeling, information retrieval ,etc. words with similar “semantic concepts” have “closer” location in the “latent semantic space” they tend to appear in similar “types” of documents, although not necessarily in exactly the same documents each component in the reduced word vector ujS=uj is the “association” of the word with the corresponding “concept” example similarity measure between two words: Document Clustering example applications: clustered language modeling, language model adaptation, information retrieval, etc. documents with similar “semantic concepts” have “closer” location in the “latent semantic space” they tend to include similar “types” of words, although not necessarily exactly the same words each component on the reduced document vector vjS=vj is the “association” of the document with the corresponding “concept” example “similarity” measure between two documents: 2 2

LSA for Linguistic Processing Cosine Similarity 𝑑 𝑞−1 −1≤ cos 𝜃 = 𝐴 ⋅ 𝐵 𝐴 𝐵 ≤1 𝐴 𝑤 𝑞 𝑤 𝑞+1 𝜃 =0 if 𝐴 ⊥ 𝐵 𝑑 𝑞 𝐵 𝐴 ⋅ 𝐵 = 𝐴 𝐵 cos 𝜃 magnitude Similarity

Example Applications in Linguistic Processing Information Retrieval “concept matching” vs “lexical matching” : relevant documents are associated with similar “concepts”, but may not include exactly the same words example approach: treating the query as a new document (by “folding-in”), and evaluating its “similarity” with all possible documents Fold-in consider a new document outside of the training corpus T, but with similar language patterns or “concepts” construct a new column dp ,p>N, with respect to the M words assuming U and S remain unchanged dp=USvpT (just as a column in W= USVT) v p = vpS = dpTU as an R-dim representation of the new document (i.e. obtaining the projection of dp on the basis ei of U by inner product)

Integration with N-gram Language Models Language Modeling for Speech Recognition Prob(wq|dq-1) wq: the q-th word in the current document to be recognized (q: sequence index) dq-1: the recognized history in the current document v q-1=dq-1TU : representation of dq-1 by vq-1 (folded-in) Prob(wq|dq-1) can be estimated by uq and v q-1 in the R-dim space integration with N-gram Prob(wq|Hq-1) =Prob(wq|hq-1, dq-1) Hq-1: history up to wq-1 hq-1:<wq-n+1, wq-n+2,... wq-1 > N-gram gives local relationships, while dq-1 gives semantic concepts dq-1 emphasizes more the key content words, while N-gram counts all words similarly including function words v q-1 for dq-1 can be estimated iteratively assuming the q-th word in the current document is wi (n) (n) i-th dimentionality out of M T v q moves in the R-dim space initially, eventually settle down somewhere

Probabilistic Latent Semantic Analysis (PLSA) Di: documents Tk: latent topics tj: terms Exactly the same as LSA, using a set of latent topics{ }to construct a new relationship between the documents and terms, but with a probabilistic framework Trained with EM by maximizing the total likelihood : frequency count of term in the document

Probabilistic Latent Semantic Analysis (PLSA) 𝑃(𝑧 𝑑 ) 𝑃(𝑤 𝑧 ) w: word z: topic d: document N: words in document d M: documents in corpus

Latent Dirichlet Allocation(LDA) 𝑃( 𝜃 𝛼 ): Dirichlet Distribution 𝜃 𝑚 : topic distribution for document m ( 𝛼 : prior for 𝜃 𝑚 ) 𝑃( 𝑧 𝑚,𝑛 𝜃 𝑚 ) 𝑃( 𝜑 𝑘 𝛽 ): Dirichlet Distribution 𝑧 𝑚,𝑛 : topic distribution of 𝑤 𝑚,𝑛 ( 𝛽 : prior for 𝜑 𝑘 ) 𝑤 𝑚,𝑛 : n−th word in document 𝑚 𝜑 𝑘 :word distribution for topic k n: word index, a total of 𝑁 𝑚 words in document 𝑚 m:document index, a total of M documents in the corpus ( k: topic index, a total of K topics ) A document is represented as random mixtures of latent topics Each topic is characterized by a distribution over words 𝑃( 𝑤 𝑚,𝑛 𝑧 𝑚,𝑛 , 𝜑 𝑘 )

Gibbs Sampling in general To obtain a distribution of a given form with unknown parameters 𝒛 𝒊 :𝒊=𝟏,⋯, 𝑴 Initialize 𝑧 𝑖 (0) :𝑖=1,⋯, 𝑀 For 𝜏= 0, ⋯ , 𝑇 : Sample 𝑧 1 (𝜏+1) ~𝑝 𝑧 1 𝑧 2 (𝜏) , 𝑧 3 (𝜏) , ⋯ , 𝑧 𝑀 (𝜏) Take a sample of 𝑧 1 base on the distribution 𝑝 𝑧 1 𝑧 2 (𝜏) , 𝑧 3 (𝜏) , ⋯ , 𝑧 𝑀 (𝜏) Sample 𝑧 2 (𝜏+1) ~𝑝 𝑧 2 𝑧 1 (𝜏+1) , 𝑧 3 (𝜏) , ⋯ , 𝑧 𝑀 (𝜏) ⋮ Sample 𝑧 𝑗 (𝜏+1) ~𝑝 𝑧 𝑗 𝑧 1 (𝜏+1) , ⋯ , 𝑧 𝑗−1 𝜏+1 , 𝑧 𝑗+1 (𝜏) , ⋯ , 𝑧 𝑀 (𝜏) Sample 𝑧 𝑀 (𝜏+1) ~𝑝 𝑧 𝑀 𝑧 1 (𝜏+1) , 𝑧 2 𝜏+1 , ⋯ , 𝑧 𝑀−1 (𝜏+1) Apply Markov Chain Monte Carlo and sample each variable sequentially conditioned on the other variables until the distribution converges, then estimate the parameters based on the converged distribution

Gibbs Sampling applied on LDA Sample P(Z,W) : ? ? ? ? Topic … w11 w12 w13 w1n Word Doc 1 w21 ? w22 w23 w2n … Doc 2 …

Gibbs Sampling applied on LDA Sample P(Z,W) : Random Initialization … w11 w12 w13 w1n Doc 1 … w21 w22 w23 w2n Doc 2 …

Gibbs Sampling applied on LDA Sample P(Z,W) : Random Initialization Erase Z11, and draw a new Z11 ~ ? … w11 w12 w13 w1n Doc 1 𝑃( 𝑧 11 𝑧 12 ⋯ 𝑧 𝑀, 𝑁 𝑀 , 𝑤 11 , 𝑤 12 ,⋯, 𝑤 𝑀, 𝑁 𝑀 ) … w21 w22 w23 w2n Doc 2 …

Gibbs Sampling applied on LDA Sample P(Z,W) : Random Initialization Erase Z11, and draw a new Z11 ~ Erase Z12, and draw a new Z12 ~ ? … w11 w12 w13 w1n Doc 1 𝑃( 𝑧 11 𝑧 12 ⋯ 𝑧 𝑀, 𝑁 𝑀 , 𝑤 11 , 𝑤 12 ,⋯, 𝑤 𝑀, 𝑁 𝑀 ) … w21 w22 w23 w2n 𝑃( 𝑧 12 𝑧 11 , 𝑧 13 ⋯ 𝑧 𝑀, 𝑁 𝑀 , 𝑤 11 ,𝑤 12 ,⋯, 𝑤 𝑀, 𝑁 𝑀 ) Doc 2 …

Gibbs Sampling applied on LDA Sample P(Z,W) : Random Initialization Erase Z11, and draw a new Z11 ~ Erase Z12, and draw a new Z12 ~ … w11 w12 w13 w1n Doc 1 𝑃( 𝑧 11 𝑧 12 ⋯ 𝑧 𝑀, 𝑁 𝑀 , 𝑤 11 , 𝑤 12 ,⋯, 𝑤 𝑀, 𝑁 𝑀 ) … w21 w22 w23 w2n 𝑃( 𝑧 12 𝑧 11 , 𝑧 13 ⋯ 𝑧 𝑀, 𝑁 𝑀 , 𝑤 11 ,𝑤 12 ,⋯, 𝑤 𝑀, 𝑁 𝑀 ) Doc 2 Iteratively update topic assignment for each word until converge Compute θ, φ according to the final setting …

Matrix Factorization (MF) for Recommendation systems Movie1 Movie2 Movie3 Movie4 Movie5 Movie6 Movie7 Movie8 Movie9 User A 3.7 4.0 User B 4.3 User C 4.1 User D 2.3 2.5 User E 3.3 User F 2.9 User G 2.6 2.7 𝑅= 𝑟 𝑢𝑖 : rating u: user i: item

Matrix Factorization (MF) Mapping both users and items to a joint latent factor space of dimensionality f latent factor: towards male, seriousness, etc. 𝑞 𝑖 𝑟 𝑢𝑖 𝑝 𝑢 𝑇 i I 1 u U = f

Matrix Factorization (MF) Objective function min 𝑞,𝑝 (𝑢, 𝑖) ( 𝑟 𝑢𝑖 − 𝑞 𝑖 𝑇 𝑝 𝑢 ) 2 +𝜆( 𝑞 𝑖 2 + 𝑝 𝑢 2 ) Training gradient descent (GD) Alternating least square (ALS): alternatively fix 𝑝 𝑢 ’s or 𝑞 𝑖 ’s and compute the other as a least square problem Different from SVD (LSA) SVD assumes missing entries to be zero (a poor assumption)

Overfitting Problem A good model is not just to fit all the training data needs to cover unseen data well which may have distributions slightly different from that of training data too complicated models with too many parameters usually leads to overfitting

Extensions of Matrix Factorization (MF) Biased MF add global bias μ (usually = average rating) , user bias bu , and item bias bi as parameters Non-negative Matrix Factorization restrict the value in each component of pu and qi to be non-negative

References LSA and PLSA LDA and Gibbs Sampling “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings of the IEEE, Aug 2000 “Latent Semantic Mapping”, IEEE Signal Processing Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication “Probabilistic Latent Semantic Indexing”, ACM Special Interest Group on Information Retrieval (ACM SIGIR), 1999 “Probabilistic Latent Semantic Indexing”, Proc. of Uncertainty in Artificial Intelligence, 1999 “Spoken Document Understanding and Organization”, IEEE Signal Processing Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication LDA and Gibbs Sampling Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer 2006 Blei, David M.; Andrew Y. Ng, Michael I. Jordan. "Latent Dirichlet Allocation”, Journal of Machine Learning Research 2003 Gregor Heinrich, ”Parameter estimation for text analysis”, 2005

References Matrix Factorization A Linear Ensemble of Individual and Blended Models for Music Rating Prediction. In JMLR W&CP, volume 18, 2011. Y. Koren, R. M. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30-37, 2009. Introduction to Matrix Factorization Methods Collaborative Filtering (http://www.intelligentmining.com/knowledge/slides/Collaborative.Fi ltering.Factorization.pdf) GraphLab API: Collaborative Filtering (http://docs.graphlab.org/collaborative_filtering.html) J Mairal, F Bach, J Ponce, G Sapiro, Online learning for matrix factorization and sparse coding, The Journal of Machine Learning, 2010