CS246: Latent Dirichlet Analysis Junghoo “John” Cho UCLA
LSI LSI uses SVD to find the best rank-K approximation The result is difficult to interpret especially with negative numbers Q: Can we develop a more interpretable method?
Probabilistic Approach Develop a probabilistic model on how users write a document based on topics. Q: How do we write a document? A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms
Two Probability Vectors For every document 𝑑, we assume that the user will first pick the topics to write about 𝑃(𝑧|𝑑): probability to pick topic 𝑧 when the user write each word in document 𝑑. 𝑧=1 𝑇 𝑃 𝑧 𝑑 =1 Document-topic vector of 𝑑 We also assume that every topic is associated with certain words with certain probability 𝑃(𝑤|𝑧) : the probability of picking the word w when the user write on the topic 𝑧. 𝑤=1 𝑊 𝑃 𝑤 𝑧 =1 Topic-word vector of 𝑧
Probabilistic Topic Model There exists T number of topics The topics-word vector for each topic is set before any document is written 𝑃(𝑤|𝑧) is set for every 𝑧 and 𝑤 Then for every document 𝑑, The user decides the topics to write on, i.e., 𝑃(𝑧|𝑑) For each word in 𝑑 The user selects a topic 𝑧 with probability 𝑃(𝑧|𝑑) The user selects a word 𝑤 with probability 𝑃(𝑤|𝑧)
Probabilistic Document Model 𝑃(𝑤|𝑧) 𝑃(𝑧|𝑑) Topic 1 Topic 2 1.0 money1 bank1 loan1 bank1 money1 ... bank loan money DOC 1 0.5 money1 river2 bank1 stream2 bank2 ... DOC 2 river stream bank 1.0 river2 stream2 river2 bank2 stream2 ... DOC 3
Example: Calculating Probability z1 = {w1:0.8, w2:0.1, w3:0.1} z2 = {w1:0.1, w2:0.2, w3:0.7} d’s topics are {z1: 0.9, z2:0.1} d has three terms {w32, w11, w21}. Q: What is the probability that a user will write such a document? A: (0.1*0.7)*(0.9*0.8)*(0.9*0.1)
Corpus Generation Probability 𝑇: # topics 𝐷: # documents 𝑀: # words per document Probability of generating the corpus 𝐶 𝑃 𝐶 = 𝑖=1 𝐷 𝑗=1 𝑀 𝑃( 𝑤 𝑖,𝑗 | 𝑧 𝑖,𝑗 )𝑃( 𝑧 𝑖,𝑗 | 𝑑 𝑖 )
Generative Model vs Inference (1) P(w|z) P(z|d) Topic 1 Topic 2 1.0 money1 bank1 loan1 bank1 money1 ... bank loan money DOC 1 0.5 money1 river2 bank1 stream2 bank2 ... DOC 2 river stream bank 1.0 river2 stream2 river2 bank2 stream2 ... DOC 3
Generative Model vs Inference (2) Topic 1 Topic 2 ? money? bank? loan? bank? money? ... ? DOC 1 ? money? river? bank? stream? bank? ... DOC 2 ? ? river? stream? river? bank? stream? ... DOC 3
Probabilistic Latent Semantic Index (pLSI) Basic Idea: We pick 𝑃(𝑧𝑗|𝑑𝑖), 𝑃(𝑤𝑘|𝑧𝑗), and 𝑧𝑖𝑗 values to maximize the corpus generation probability Maximum-likelihood estimation (MLE) More discussion later on how to compute the 𝑃(𝑧𝑗|𝑑𝑖), 𝑃(𝑤𝑘|𝑧𝑗), and 𝑧𝑖𝑗 values that maximize the probability
Problem of pLSI Q: 1M documents, 1000 topics, 1M words. 1000 words/doc. How much input data? How many variables do we have to estimate? Q: Too much freedom. How can we avoid overfitting problem? A: Adding constraints to reduce degree of freedom
Latent Dirichlet Analysis (LDA) When term probabilities are selected for each topic Topic-term probability vector, (P(w1|zj), …, P(wW|zj)), is sampled randomly from Dirichlet distribution When users select topics for a document Document-topic probability vector, (P(z1|d), …, P(zT|d)), is sampled randomly from Dirichlet distribution
What is Dirichlet Distribution? Multinomial distribution Given the probability pi of each event ei, what is the probability that each event ei occurs ⍺i times after n trial? We assume pi’s. The distribution assigns ⍺i’s probability. Dirichlet distribution “Inverse” of multinomial distribution: We assume ⍺i’s. The distribution assigns pi’s probability.
Dirichlet Distribution Q: Given ⍺1, ⍺2,…, ⍺k, what are the most likely p1, p2, pk values?
Normalized Probability Vector and Simplex Plane When ( 𝑝 1 ,…, 𝑝 𝑛 ) satisfies 𝑝 1 +…+ 𝑝 𝑛 =1, they are on a “(n-1)- simplex plane” Remember that 𝑧=1 𝑇 𝑃 𝑧 𝑑 =1 and 𝑤=1 𝑊 𝑃 𝑤 𝑧 =1 Example: ( 𝑝 1 , 𝑝 2 , 𝑝 3 ) and their 2-simplex plane 𝑝 1 𝑝 2 𝑝 3 1
Effect of ⍺ values p1 p2 p3 p1 p2 p3
Effect of ⍺ values p1 p2 p3 p1 p2 p3
Effect of ⍺ values p1 p2 p3 p1 p2 p3
Effect of ⍺ values p1 p1 p3 p3 p2 p2
Minor Correction is not a standard Dirichlet distribution The “standard” Dirichlet distribution formula: I used non-standard to make the connection to multinomial distribution clear From now on, we use the standard formula
Back to LDA Document Generation Model For each topic z Pick the word probability vector 𝑃(𝑤|𝑧)’s by taking a random sample from Dir(β1,…, βW) For every document d The user decides its topic vector 𝑃(𝑧|𝑑)’s by taking a random sample from Dir(⍺1,…, ⍺T) For each word in d The user selects a topic z with probability 𝑃(𝑧|𝑑) The user selects a word w with probability 𝑃(𝑤|𝑧) Once all is said and done, we have 𝑃(𝑤|𝑧): topic-term vector for each topic 𝑃(𝑧|𝑑): document-topic vector for each document Topic assignment to every word in each document
Symmetric Dirichlet Distribution In principle, we need to assume two vectors, (⍺1,…, ⍺T) and (β1 ,…, βW) as input parameters. In practice, we often assume all ⍺i’s are equal to ⍺ and all βi’s = β Use two scalar values ⍺ and β, not two vectors. Symmetric Dirichlet distribution Q: What is the implication of this assumption?
Effect of ⍺ value on Symmetric Dirichlet Q: What does it mean? How will the sampled document topic vectors change as ⍺ grows? Common choice: ⍺ = 50/T, b = 200/W p1 p1 p3 p3 p2 p2
Plate Notation a P(z|d) z b w P(w|z) M T N