Integrating Topics and Syntax -Thomas L

Integrating Topics and Syntax -Thomas L
Integrating Topics and Syntax -Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum Han Liu Department of Computer Science University of Illinois at Urbana-Champaign April 12th. 2005

Outline Motivations – Syntactic vs. semantic modeling
Formalization – Notations and terminology Generative Models – pLSI; Latent Dirichlet Allocation Composite Models –HMMs + LDA Inference – MCMC (Metropolis; Gibbs Sampling ) Experiments – Performance and evaluations Summary – Bayesian hierarchical models Discussions ! Han Liu

Motivations Statistical language modeling
- Syntactic dependencies  short range dependencies - Semantic dependencies  long-range Current models only consider one aspect - Hidden Markov Models (HMMs) : syntactic modeling - Latent Dirichlet Allocation (LDA) : semantic modeling - Probabilistic Latent Semantic Indexing (LSI) : semantic modeling A model which could capture both kinds of dependencies may be more useful! Han Liu

Problem Formalization
Word - A word is an item from a vocabulary indexed by {1,…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0 Document - A document is a sequence of N words denoted by w = {w1, w2 , … , wN}, where wi is the ith word in the sequence. Corpus - A corpus is a collection of M documents, denoted by D = {w1, w2 , … , wM} Han Liu

Latent Semantic Structure
Distribution over words Latent Structure Inferring latent structure Words Prediction Han Liu

Probabilistic Generative Models
Probabilistic Latent Semantic Indexing (pLSI) - Hoffman (1999) ACM SIGIR - Probabilistic semantic model Latent Dirichlet Allocation (LDA) - Blei, Ng, & Jordan (2003) J. of Machine Learning Res. Hidden Markov Models (HMMs) - Baum, & Petrie (1966) Ann. Math. Stat. - Probabilistic syntactic model Han Liu

Dirichelt vs. Multinomial Distributions
Dirichlet Distribution (conjugate prior) Multinomial Distribution Han Liu

Probabilistic LSI : Graphical Model
model the distribution over topics d Topic as latent variables z w generate a word from that topic Nd d D Han Liu

Probabilistic LSI- Parameter Estimation
The log-likelihood of Probabilistic LSI EM - algorithm - E - Step - M- Step Han Liu

LDA : Graphical Model a q z b f w sample a distribution over topics
sample a topic z b f w sample a word from that topic Nd d D T Han Liu

Latent Dirichlet Allocation
A variant LDA developed by Griffith 2003 - choose N |x ~ Poisson ( x ) - sample q |a ~ Dir (a ) - sample f |b ~ Dir( b ) - sample z |q ~ Multinomial (q ) - sample w| z, f(z) ~ Multinomial (f(z) ) Model Inference - all the Dirichlet prior is assumed to be symmetric - Instead of using variational inference and empirical Bayes parameter estimation, Gibbs Sampling is adopted Han Liu

The Composite Model q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4
An intuitive representation q z1 z2 z3 z4 w1 w2 w3 w4 s1 s2 s3 s4 Semantic state: generate words from LDA Syntactic states: generate words from HMMs Han Liu

Composite Model : Graphical Model
q c z p g b F(z) w F(c) d Nd d C T M Han Liu

The Composite Model: Generative process
Han Liu

Bayesian Inference EM algorithm can be applied to the composite model
- treating q, f(z) , f(c) , p(c) as parameters - log P(w| q, f(z) , f(c) , p(c) ) as the likelihood - too many parameters and too slow convergence - the dirichelet priors are necessary assumptions ! Markov Chain Monte Carlo (MCMC) - Instead of explicitly representing q, f(z) , f(c) , p(c) , we consider the posterior distribution over the assignment of words to topics or classes P( z|w) and P(c|w) Han Liu

Markov Chain Monte Carlo
Sampling posterior distribution according to a Markov Chain - an ergodic (irreducible & aperiodic ) Markov chain converges to a unique equilibrium distribution p (x) - Try to sample the parameters according to a Makrov chain, whose equilibrium distribution p (x) is just he posterior distribution p (x) The key task is to construct the suitable T(x,x’) Han Liu

Metropolis-Hastings Algorithm
Sampling by constructing a reversible Markov chain - a reversible Markov chain could guarantee the condition of the equilibrium distribution p (x) - Simultaneous Metropolis Hastings Algorithm holds a similar idea as rejection sampling Han Liu

Metropolis-Hastings Algorithm (cont.)
loop sample x’ from Q( x, x’); a =min{1, (p (x’)/ p (x))*(Q( x(t), x’) / Q (x’, x(t)))}; r = U(0,1); if a < r reject, x(t+1) = x(t); else accept, x(t+1) =x’; end; - Metropolis Hastings Intuition xt r=1.0 x* r=p(x*)/p(xt) x* Han Liu

Metropolis-Hastings Algorithm
Why it works Single-site Updating algorithm Han Liu

Gibbs Sampling A special case of single-site Updating Metropolis
Han Liu

Gibbs Sampling for Composite Model
q, f, p are all integrated out from the corresponding terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithm Han Liu

Experiments Corpora Experimental Design
- Brown corpus 500 documents, 1,137,466 words - TASA corpus, 37,651 documents, 12,190,931 word tokens - NIPS corpus, 1713 documents, 4,312,614 word tokens - W = 37,202 (Brown + TASA); W = 17,268 (NIPS) Experimental Design - one class for sentence start/end markers {., ?,!} - T=200 & C=20 (composite); C=2 (LDA); T=1 (HMMs) - 4,000 iterations, with 2000 burn in and 100 lag - 1st,2nd, 3rd Markov Chains are considered Han Liu

Identifying function and content words
Han Liu

Comparative study on NIPS corpus (T=100 & C = 50)
Han Liu

Identifying function and content words (NIPS)
Han Liu

Marginal probabilities
Bayesian model comparison - P(w|M ) are calculated using the harmonic mean of the likelihoods over the 2000 iterations - To evaluate the Bayes factors Han Liu

Part of Speech Tagging Assessed performance on the Brown corpus
- One set consisted all Brown tags (297) - The other set collapsed Browns tags into 10 designations - The 20th sample used, evaluated by Adjusted Rand Index - Compare with DC on the 1000 most frequent words on 19 clusters Han Liu

Document Classification
Evaluated by Naïve Bayes Classifier - 500 documents in Brown are classified into 15 groups - The topic vectors produced by LDA and composite model are used for training Naïve Bayes classifier - 10-flod cross validation is used to evaluate the 20th sample Result (baseline accuracy: 0.09) - Trained on Brown : LDA (0.51); 1st Composite model (0.45) - Brown + TASA : LDA (0.54); 1st Composite model (0.45) - Explanation: only about 20% words are allocated to the semantic component, too few to find correlations! Han Liu

Summary Bayesian hierarchical models are natural for text modeling
Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic modules Discovering the syntactic and semantic building blocks form the basis of more sophisticated representation Similar ideas could be generalized to the other areas Han Liu

Discussions Gibbs Sampling vs. EM algorithm ?
Hieratical models reduce the number of Parameters, what about model complexity? Equal prior for Bayesian model comparison? Whether there is really any effect of the 4 hyper-parameters? Probabilistic LSI does not have normal distribution assumption, while Probabilistic PCA assumes normal! EM is sensitive to local maxima, why Bayesian goes through? Is document classification experiment a good evaluation? Majority vote for tagging? Han Liu

Integrating Topics and Syntax -Thomas L

Similar presentations

Presentation on theme: "Integrating Topics and Syntax -Thomas L"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrating Topics and Syntax -Thomas L

Similar presentations

Presentation on theme: "Integrating Topics and Syntax -Thomas L"— Presentation transcript:

Similar presentations

About project

Feedback