Lecture #13: Gibbs Sampling for LDA

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
Bayesian Estimation in MARK
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Probabilistic Clustering-Projection Model for Discrete Data
Statistical Topic Modeling part 1
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
Generative Topic Models for Community Analysis
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Latent Dirichlet Allocation a generative model for text
Today Introduction to MCMC Particle filters and MCMC
British Museum Library, London Picture Courtesy: flickr.
Multiscale Topic Tomography Ramesh Nallapati, William Cohen, Susan Ditmore, John Lafferty & Kin Ung (Johnson and Johnson Group)
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Scalable Text Mining with Sparse Generative Models
Statistical NLP Winter 2009 Lecture 5: Unsupervised text categorization Roger Levy Many thanks to Tom Griffiths for use of his slides.
Bayesian topic models Tom Griffiths UC Berkeley.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Latent Dirichlet Allocation (LDA) Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University and Arvind Ramanathan of Oak Ridge National.
Online Learning for Latent Dirichlet Allocation
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Finding scientific topics Tom Griffiths Stanford University Mark Steyvers UC Irvine.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Probabilistic Topic Models
27. May Topic Models Nam Khanh Tran L3S Research Center.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Storylines from Streaming Text The Infinite Topic Cluster Model Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing Carnegie Mellon.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Latent Dirichlet Allocation (LDA)
Probabilistic Topic Models Hongning Wang Outline 1.General idea of topic models 2.Basic topic models -Probabilistic Latent Semantic Analysis (pLSA)
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Learning Deep Generative Models by Ruslan Salakhutdinov
Online Multiscale Dynamic Topic Models
Bayesian data analysis
Multimodal Learning with Deep Boltzmann Machines
Learning Sequence Motif Models Using Expectation Maximization (EM)
Latent Dirichlet Analysis
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Bayesian Inference for Mixture Language Models
Michal Rosen-Zvi University of California, Irvine
Expectation-Maximization & Belief Propagation
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Markov Networks.
Presentation transcript:

Lecture #13: Gibbs Sampling for LDA This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 679: Text Mining Lecture #13: Gibbs Sampling for LDA Credit: Many slides are from presentations by Tom Griffiths of Berkeley.

Announcements Required reading for Today Final Project Proposal Griffiths & Steyvers: “Finding Scientific Topics” Final Project Proposal Clear, detailed: ideally, the first half of your project report! Talk to me about ideas Teams are an option Due date to be specified

Objectives Gain further understanding of LDA Understand the intractability of inference with the model Gain further insight into Gibbs sampling Understand how to estimate the parameters of interest in LDA using a collapsed Gibbs sampler

Latent Dirichlet Allocation (slightly different symbols this time) (Blei, Ng, & Jordan, 2001; 2003)  Dirichlet priors distribution over topics for each document  (d)  (d)  Dirichlet()  distribution over words for each topic topic assignment for each word zi zi  Categorical( (d) )  (j)  Dirichlet()  (j) T word generated from assigned topic wi wi  Categorical( (zi) ) Nd D

The Statistical Problem of Meaning Generating data from parameters is easy Learning parameters from data is hard What does it mean to identify the “meaning” of a document?

Estimation of the LDA Generative Model Maximum likelihood estimation (EM) Similar to method presented by Hofmann for pLSI (1999) Deterministic approximate algorithms Variational EM (Blei, Ng & Jordan, 2001, 2003) Expectation propagation (Minka & Lafferty, 2002) Markov chain Monte Carlo – our focus Full Gibbs sampler (Pritchard et al., 2000) Collapsed Gibbs sampler (Griffiths & Steyvers, 2004) The papers you read for today

Estimation of the Generative Model Maximum likelihood estimation (EM) Variational EM (Blei, Ng & Jordan, 2002) Bayesian inference (collapsed) WT + DT parameters WT + T parameters 0 parameters

Review: Markov Chain Monte Carlo (MCMC) Sample from a Markov chain converges to a target distribution Allows sampling from an unnormalized posterior distribution Can compute approximate statistics from intractable distributions (MacKay, 2002)

Review: Gibbs Sampling Most straightforward kind of MCMC For variables 𝑥1, 𝑥2, …, 𝑥𝑛 Require the full (or “complete”) conditional distribution for each variable: Draw 𝑥 𝑖 𝑡 from 𝑃( 𝑥 𝑖 | 𝑥 −𝑖 )=𝑃(𝑥𝑖|𝑀𝐵(𝑥𝑖)) x-i = x1(t), x2(t),…, xi-1(t), xi+1(t-1), …, xn(t-1)

Bayesian Inference in LDA We would like to reason with the full joint distribution: 𝑃 𝑤 , 𝑧 ,Φ, Θ 𝛼,𝛽 =𝑃 𝑤 , 𝑧 Φ,Θ 𝑃 Φ 𝛽 𝑃(Θ|𝛼) Given 𝑤 , the distribution over the latent variables is desirable, but the denominator (the marginal likelihood) is intractable to compute: 𝑃 𝑧 ,Φ,Θ 𝑤 ,𝛼,𝛽 = 𝑃 𝑤 , 𝑧 ,Φ,Θ 𝛼,𝛽 𝑃 𝑤 𝛼,𝛽 We marginalize the model parameters out of the joint distribution so that we can focus on the words in the corpus ( 𝑤 ) and their assigned topics ( 𝑧 ): 𝑃 𝑤 , 𝑧 |𝛼,𝛽 = Φ Θ 𝑃 𝑤 , 𝑧 Φ, Θ 𝑃 Φ 𝛽 𝑃 Θ 𝛼 𝑑Θ𝑑Φ This leads to our use of the term “collapsed sampler”

Posterior Inference in LDA From this marginalized joint dist., we can compute the posterior distribution over topics for a given corpus ( 𝑤 ): 𝑃 𝑧 𝑤 ,𝛼,𝛽 = 𝑃 𝑤 , 𝑧 |𝛼,𝛽 𝑃 𝑤 |𝛼,𝛽 = 𝑃 𝑤 , 𝑧 |𝛼,𝛽 𝑧 𝑃 𝑤 , 𝑧 |𝛼,𝛽 But 𝑧 = 𝑇 𝑛 possible topic assignments, where 𝑛 is the number of tokens in the corpus! i.e., inference is still intractable! Working with this topic posterior is only tractable up to a constant multiple: 𝑃 𝑧 𝑤 ,𝛼,𝛽 ∝𝑃 𝑤 , 𝑧 |𝛼,𝛽

Collapsed Gibbs Sampler for LDA Since we’re now focusing on the topic posterior, namely: 𝑃 𝑧 𝑤 ,𝛼,𝛽 ∝𝑃 𝑤 , 𝑧 |𝛼,𝛽 𝑃 𝑤 | 𝑧 ,𝛼,𝛽 𝑃 𝑧 |𝛼,𝛽 Let’s find these factors by marginalizing separately: This works out well due to the conjugacy of the Dirichlet and multinomial (/ categorical) distributions. Where: 𝑛 𝑗 𝑤 is the number of times word 𝑤 assigned to topic 𝑗 𝑛 𝑗 𝑑 is the number of times topic 𝑗 is used in document 𝑑

Collapsed Gibbs Sampler for LDA We only sample each 𝑧𝑖 ! Complete (or full) conditionals can now be derived for each 𝑧𝑖 in 𝑧 . See (Heinrich, 2008) for details of the math -- linked on the schedule Where: 𝑑 𝑖 is the document in which word wi occurs 𝑛 −𝑖,𝑗 𝑤 is the number of times (ignoring position i) word w assigned to topic j 𝑛 −𝑖,𝑗 𝑑 is the number of times (ignoring position i) topic j used in document d

Steps for deriving the complete conditionals Begin with the full joint distribution over the data, latent variables, and model parameters, given the fixed parameters and of the prior distributions. Write out the desired collapsed joint distribution and set it equal to the appropriate integral over the full joint in order to marginalize over and . Perform algebra and group like terms. Expand the generic notation by applying the closed-form definitions of the Multinomial, Categorical, and Dirichlet distributions. Transform the representation: change the product indices from products over documents and word sequences, to products over cluster labels and token counts. Simplify by combining products, adding exponents and pulling constant multipliers outside of integrals. When you have integrals over terms that are in the form of the kernel of the Dirichlet distribution, consider how to convert the result into a familiar distribution. Once you have the expression for the joint, derive the expression for the conditional distribution

Collapsed Gibbs Sampler for LDA For 𝑡 = 1 to 𝑏𝑢𝑟𝑛+𝑙𝑒𝑛𝑔𝑡ℎ: For variables 𝑧 =𝑧1, 𝑧2, …, 𝑧𝑛 (i.e., for 𝑖 = 1 to 𝑛): Draw 𝑧 𝑖 𝑡 from 𝑃 𝑧 𝑖 𝑧 −𝑖 , 𝑤 𝑧 −𝑖 = 𝑧 1 𝑡 , 𝑧 2 𝑡 ,…, 𝑧 𝑖−1 𝑡 , 𝑧 𝑖+1 𝑡−1 , …, 𝑧 𝑛 𝑡−1

Collapsed Gibbs Sampler for LDA This is nicer than your average Gibbs sampler: Memory: counts (the “ 𝑛 ⋅ ⋅ ” counts) can be cached in two sparse matrices No special functions, simple arithmetic The distributions on Φ and Θ are analytic in topic assignments 𝑧 and 𝑤 , and can later be recomputed from the samples in a given iteration of the sampler:  from 𝑤 | 𝑧  from 𝑧

Gibbs sampling in LDA T=2 Nd=10 M=5 iteration 1

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2

Gibbs sampling in LDA iteration 1 2 … 1000

A Visual Example: Bars sample each pixel from a mixture of topics pixel = word image = document A toy problem. Just a metaphor for inference on text.

Documents generated from the topics.

Evolution of the topics (𝜙 matrix)

Interpretable decomposition SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good

Effects of Hyper-parameters  and  control the relative sparsity of  and  smaller : fewer topics per document smaller : fewer words per topic Good assignments z are a compromise in sparsity

Effects of Hyper-parameters  and  control the relative sparsity of  and  smaller : fewer topics per document smaller : fewer words per topic Good assignments z are a compromise in sparsity log (x) x

Bayesian model selection How many topics do we need? A Bayesian would consider the posterior: Involves summing over assignments z P(T|w)  P(w|T) P(T)

Bayesian model selection P( w |T ) T = 100 Corpus (w)

Bayesian model selection P( w |T ) T = 100 Corpus (w)

Bayesian model selection P( w |T ) T = 100 Corpus (w)

Sweeping T

Analysis of PNAS abstracts Used all D = 28,154 abstracts from 1991-2001 Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus Also, PNAS class designations for 2001 (Acknowledgment: Kevin Boyack)

Running the algorithm Memory requirements linear in T(W+D), runtime proportional to nT T = 50, 100, 200, 300, 400, 500, 600, (1000) Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego

How many topics?

Topics by Document Length Not sure what this means

A Selection of Topics P(w | z)  FORCE SURFACE MOLECULES SOLUTION SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN P(w | z) 

Cold topics Hot topics

Cold topics Hot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

Cold topics Hot topics 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

Conclusions Estimation/inference in LDA is more or less straightforward using Gibbs Sampling i.e., easy! Not so easy in all graphical models

Coming Soon Topical n-grams