CS246: Latent Dirichlet Analysis

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

A Tutorial on Learning with Bayesian Networks
Information retrieval – LSI, pLSI and LDA
Exponential Distribution. = mean interval between consequent events = rate = mean number of counts in the unit interval > 0 X = distance between events.
Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Statistical Topic Modeling part 1
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Latent Dirichlet Allocation a generative model for text
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Statistical inference form observational data Parameter estimation: Method of moments Use the data you have to calculate first and second moment To fit.
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Lecture 7 1 Statistics Statistics: 1. Model 2. Estimation 3. Hypothesis test.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Popularity-Aware Topic Model for Social Graphs Junghoo “John” Cho UCLA.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Integrating Topics and Syntax -Thomas L
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Latent Dirichlet Allocation
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Qualitative and Limited Dependent Variable Models ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Inferring User Interest Familiarity and Topic Similarity with Social Neighbors in Facebook INSTRUCTOR: DONGCHUL KIM ANUSHA BOOTHPUR
Chapter 8 Confidence Interval Estimation Statistics For Managers 5 th Edition.
Active Learning Lecture Slides For use with Classroom Response Systems
16BA608/FINANCIAL MANAGEMENT
Inference for the Mean of a Population
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
Inference for Two-Samples
Inference for Regression (Chapter 14) A.P. Stats Review Topic #3
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Introduction to estimation: 2 cases
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Relevance Feedback Hongning Wang
Latent Dirichlet Analysis
Goodness-of-Fit Tests
More about Posterior Distributions
Econ 3790: Business and Economics Statistics
Discrete Event Simulation - 4
Chapter 6 Confidence Intervals.
Topic Modeling Nick Jordan.
Topic models for corpora and for graphs
Inference on Categorical Data
More Parameter Learning, Multinomial and Continuous Variables
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
Probabilistic Latent Preference Analysis
Junghoo “John” Cho UCLA
Topic models for corpora and for graphs
Lecture 13: Singular Value Decomposition (SVD)
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
CS 430: Information Discovery
CS590I: Information Retrieval
Maximum Likelihood We have studied the OLS estimator. It only applies under certain assumptions In particular,  ~ N(0, 2 ) But what if the sampling distribution.
David Kauchak CS159 Spring 2019
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
CS249: Neural Language Model
Presentation transcript:

CS246: Latent Dirichlet Analysis Junghoo “John” Cho UCLA

LSI LSI uses SVD to find the best rank-K approximation The result is difficult to interpret especially with negative numbers Q: Can we develop a more interpretable method?

Probabilistic Approach Develop a probabilistic model on how users write a document based on topics. Q: How do we write a document? A: (1) Pick the topic(s) (2) Start writing on the topic(s) with related terms

Two Probability Vectors For every document 𝑑, we assume that the user will first pick the topics to write about 𝑃(𝑧|𝑑): probability to pick topic 𝑧 when the user write each word in document 𝑑. 𝑧=1 𝑇 𝑃 𝑧 𝑑 =1 Document-topic vector of 𝑑 We also assume that every topic is associated with certain words with certain probability 𝑃(𝑤|𝑧) : the probability of picking the word w when the user write on the topic 𝑧. 𝑤=1 𝑊 𝑃 𝑤 𝑧 =1 Topic-word vector of 𝑧

Probabilistic Topic Model There exists T number of topics The topics-word vector for each topic is set before any document is written 𝑃(𝑤|𝑧) is set for every 𝑧 and 𝑤 Then for every document 𝑑, The user decides the topics to write on, i.e., 𝑃(𝑧|𝑑) For each word in 𝑑 The user selects a topic 𝑧 with probability 𝑃(𝑧|𝑑) The user selects a word 𝑤 with probability 𝑃(𝑤|𝑧)

Probabilistic Document Model 𝑃(𝑤|𝑧) 𝑃(𝑧|𝑑) Topic 1 Topic 2 1.0 money1 bank1 loan1 bank1 money1 ... bank loan money DOC 1 0.5 money1 river2 bank1 stream2 bank2 ... DOC 2 river stream bank 1.0 river2 stream2 river2 bank2 stream2 ... DOC 3

Example: Calculating Probability z1 = {w1:0.8, w2:0.1, w3:0.1} z2 = {w1:0.1, w2:0.2, w3:0.7} d’s topics are {z1: 0.9, z2:0.1} d has three terms {w32, w11, w21}. Q: What is the probability that a user will write such a document? A: (0.1*0.7)*(0.9*0.8)*(0.9*0.1)

Corpus Generation Probability 𝑇: # topics 𝐷: # documents 𝑀: # words per document Probability of generating the corpus 𝐶 𝑃 𝐶 = 𝑖=1 𝐷 𝑗=1 𝑀 𝑃( 𝑤 𝑖,𝑗 | 𝑧 𝑖,𝑗 )𝑃( 𝑧 𝑖,𝑗 | 𝑑 𝑖 )

Generative Model vs Inference (1) P(w|z) P(z|d) Topic 1 Topic 2 1.0 money1 bank1 loan1 bank1 money1 ... bank loan money DOC 1 0.5 money1 river2 bank1 stream2 bank2 ... DOC 2 river stream bank 1.0 river2 stream2 river2 bank2 stream2 ... DOC 3

Generative Model vs Inference (2) Topic 1 Topic 2 ? money? bank? loan? bank? money? ... ? DOC 1 ? money? river? bank? stream? bank? ... DOC 2 ? ? river? stream? river? bank? stream? ... DOC 3

Probabilistic Latent Semantic Index (pLSI) Basic Idea: We pick 𝑃(𝑧𝑗|𝑑𝑖), 𝑃(𝑤𝑘|𝑧𝑗), and 𝑧𝑖𝑗 values to maximize the corpus generation probability Maximum-likelihood estimation (MLE) More discussion later on how to compute the 𝑃(𝑧𝑗|𝑑𝑖), 𝑃(𝑤𝑘|𝑧𝑗), and 𝑧𝑖𝑗 values that maximize the probability

Problem of pLSI Q: 1M documents, 1000 topics, 1M words. 1000 words/doc. How much input data? How many variables do we have to estimate? Q: Too much freedom. How can we avoid overfitting problem? A: Adding constraints to reduce degree of freedom

Latent Dirichlet Analysis (LDA) When term probabilities are selected for each topic Topic-term probability vector, (P(w1|zj), …, P(wW|zj)), is sampled randomly from Dirichlet distribution When users select topics for a document Document-topic probability vector, (P(z1|d), …, P(zT|d)), is sampled randomly from Dirichlet distribution

What is Dirichlet Distribution? Multinomial distribution Given the probability pi of each event ei, what is the probability that each event ei occurs ⍺i times after n trial? We assume pi’s. The distribution assigns ⍺i’s probability. Dirichlet distribution “Inverse” of multinomial distribution: We assume ⍺i’s. The distribution assigns pi’s probability.

Dirichlet Distribution Q: Given ⍺1, ⍺2,…, ⍺k, what are the most likely p1, p2, pk values?

Normalized Probability Vector and Simplex Plane When ( 𝑝 1 ,…, 𝑝 𝑛 ) satisfies 𝑝 1 +…+ 𝑝 𝑛 =1, they are on a “(n-1)- simplex plane” Remember that 𝑧=1 𝑇 𝑃 𝑧 𝑑 =1 and 𝑤=1 𝑊 𝑃 𝑤 𝑧 =1 Example: ( 𝑝 1 , 𝑝 2 , 𝑝 3 ) and their 2-simplex plane 𝑝 1 𝑝 2 𝑝 3 1

Effect of ⍺ values p1 p2 p3 p1 p2 p3

Effect of ⍺ values p1 p2 p3 p1 p2 p3

Effect of ⍺ values p1 p2 p3 p1 p2 p3

Effect of ⍺ values p1 p1 p3 p3 p2 p2

Minor Correction is not a standard Dirichlet distribution The “standard” Dirichlet distribution formula: I used non-standard to make the connection to multinomial distribution clear From now on, we use the standard formula

Back to LDA Document Generation Model For each topic z Pick the word probability vector 𝑃(𝑤|𝑧)’s by taking a random sample from Dir(β1,…, βW) For every document d The user decides its topic vector 𝑃(𝑧|𝑑)’s by taking a random sample from Dir(⍺1,…, ⍺T) For each word in d The user selects a topic z with probability 𝑃(𝑧|𝑑) The user selects a word w with probability 𝑃(𝑤|𝑧) Once all is said and done, we have 𝑃(𝑤|𝑧): topic-term vector for each topic 𝑃(𝑧|𝑑): document-topic vector for each document Topic assignment to every word in each document

Symmetric Dirichlet Distribution In principle, we need to assume two vectors, (⍺1,…, ⍺T) and (β1 ,…, βW) as input parameters. In practice, we often assume all ⍺i’s are equal to ⍺ and all βi’s = β Use two scalar values ⍺ and β, not two vectors. Symmetric Dirichlet distribution Q: What is the implication of this assumption?

Effect of ⍺ value on Symmetric Dirichlet Q: What does it mean? How will the sampled document topic vectors change as ⍺ grows? Common choice: ⍺ = 50/T, b = 200/W p1 p1 p3 p3 p2 p2

Plate Notation a P(z|d) z b w P(w|z) M T N