Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture #13: Gibbs Sampling for LDA

Similar presentations


Presentation on theme: "Lecture #13: Gibbs Sampling for LDA"— Presentation transcript:

1 Lecture #13: Gibbs Sampling for LDA
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 679: Text Mining Lecture #13: Gibbs Sampling for LDA Credit: Many slides are from presentations by Tom Griffiths of Berkeley.

2 Announcements Required reading for Today Final Project Proposal
Griffiths & Steyvers: “Finding Scientific Topics” Final Project Proposal Clear, detailed: ideally, the first half of your project report! Talk to me about ideas Teams are an option Due date to be specified

3 Objectives Gain further understanding of LDA
Understand the intractability of inference with the model Gain further insight into Gibbs sampling Understand how to estimate the parameters of interest in LDA using a collapsed Gibbs sampler

4 Latent Dirichlet Allocation (slightly different symbols this time)
(Blei, Ng, & Jordan, 2001; 2003) Dirichlet priors distribution over topics for each document  (d)  (d)  Dirichlet() distribution over words for each topic topic assignment for each word zi zi  Categorical( (d) )  (j)  Dirichlet()  (j) T word generated from assigned topic wi wi  Categorical( (zi) ) Nd D

5 The Statistical Problem of Meaning
Generating data from parameters is easy Learning parameters from data is hard What does it mean to identify the “meaning” of a document?

6 Estimation of the LDA Generative Model
Maximum likelihood estimation (EM) Similar to method presented by Hofmann for pLSI (1999) Deterministic approximate algorithms Variational EM (Blei, Ng & Jordan, 2001, 2003) Expectation propagation (Minka & Lafferty, 2002) Markov chain Monte Carlo – our focus Full Gibbs sampler (Pritchard et al., 2000) Collapsed Gibbs sampler (Griffiths & Steyvers, 2004) The papers you read for today

7 Estimation of the Generative Model
Maximum likelihood estimation (EM) Variational EM (Blei, Ng & Jordan, 2002) Bayesian inference (collapsed) WT + DT parameters WT + T parameters 0 parameters

8 Review: Markov Chain Monte Carlo (MCMC)
Sample from a Markov chain converges to a target distribution Allows sampling from an unnormalized posterior distribution Can compute approximate statistics from intractable distributions (MacKay, 2002)

9 Review: Gibbs Sampling
Most straightforward kind of MCMC For variables 𝑥1, 𝑥2, …, 𝑥𝑛 Require the full (or “complete”) conditional distribution for each variable: Draw 𝑥 𝑖 𝑡 from 𝑃( 𝑥 𝑖 | 𝑥 −𝑖 )=𝑃(𝑥𝑖|𝑀𝐵(𝑥𝑖)) x-i = x1(t), x2(t),…, xi-1(t), xi+1(t-1), …, xn(t-1)

10 Bayesian Inference in LDA
We would like to reason with the full joint distribution: 𝑃 𝑤 , 𝑧 ,Φ, Θ 𝛼,𝛽 =𝑃 𝑤 , 𝑧 Φ,Θ 𝑃 Φ 𝛽 𝑃(Θ|𝛼) Given 𝑤 , the distribution over the latent variables is desirable, but the denominator (the marginal likelihood) is intractable to compute: 𝑃 𝑧 ,Φ,Θ 𝑤 ,𝛼,𝛽 = 𝑃 𝑤 , 𝑧 ,Φ,Θ 𝛼,𝛽 𝑃 𝑤 𝛼,𝛽 We marginalize the model parameters out of the joint distribution so that we can focus on the words in the corpus ( 𝑤 ) and their assigned topics ( 𝑧 ): 𝑃 𝑤 , 𝑧 |𝛼,𝛽 = Φ Θ 𝑃 𝑤 , 𝑧 Φ, Θ 𝑃 Φ 𝛽 𝑃 Θ 𝛼 𝑑Θ𝑑Φ This leads to our use of the term “collapsed sampler”

11 Posterior Inference in LDA
From this marginalized joint dist., we can compute the posterior distribution over topics for a given corpus ( 𝑤 ): 𝑃 𝑧 𝑤 ,𝛼,𝛽 = 𝑃 𝑤 , 𝑧 |𝛼,𝛽 𝑃 𝑤 |𝛼,𝛽 = 𝑃 𝑤 , 𝑧 |𝛼,𝛽 𝑧 𝑃 𝑤 , 𝑧 |𝛼,𝛽 But 𝑧 = 𝑇 𝑛 possible topic assignments, where 𝑛 is the number of tokens in the corpus! i.e., inference is still intractable! Working with this topic posterior is only tractable up to a constant multiple: 𝑃 𝑧 𝑤 ,𝛼,𝛽 ∝𝑃 𝑤 , 𝑧 |𝛼,𝛽

12 Collapsed Gibbs Sampler for LDA
Since we’re now focusing on the topic posterior, namely: 𝑃 𝑧 𝑤 ,𝛼,𝛽 ∝𝑃 𝑤 , 𝑧 |𝛼,𝛽 𝑃 𝑤 | 𝑧 ,𝛼,𝛽 𝑃 𝑧 |𝛼,𝛽 Let’s find these factors by marginalizing separately: This works out well due to the conjugacy of the Dirichlet and multinomial (/ categorical) distributions. Where: 𝑛 𝑗 𝑤 is the number of times word 𝑤 assigned to topic 𝑗 𝑛 𝑗 𝑑 is the number of times topic 𝑗 is used in document 𝑑

13 Collapsed Gibbs Sampler for LDA
We only sample each 𝑧𝑖 ! Complete (or full) conditionals can now be derived for each 𝑧𝑖 in 𝑧 . See (Heinrich, 2008) for details of the math -- linked on the schedule Where: 𝑑 𝑖 is the document in which word wi occurs 𝑛 −𝑖,𝑗 𝑤 is the number of times (ignoring position i) word w assigned to topic j 𝑛 −𝑖,𝑗 𝑑 is the number of times (ignoring position i) topic j used in document d

14 Steps for deriving the complete conditionals
Begin with the full joint distribution over the data, latent variables, and model parameters, given the fixed parameters and of the prior distributions. Write out the desired collapsed joint distribution and set it equal to the appropriate integral over the full joint in order to marginalize over and . Perform algebra and group like terms. Expand the generic notation by applying the closed-form definitions of the Multinomial, Categorical, and Dirichlet distributions. Transform the representation: change the product indices from products over documents and word sequences, to products over cluster labels and token counts. Simplify by combining products, adding exponents and pulling constant multipliers outside of integrals. When you have integrals over terms that are in the form of the kernel of the Dirichlet distribution, consider how to convert the result into a familiar distribution. Once you have the expression for the joint, derive the expression for the conditional distribution

15 Collapsed Gibbs Sampler for LDA
For 𝑡 = 1 to 𝑏𝑢𝑟𝑛+𝑙𝑒𝑛𝑔𝑡ℎ: For variables 𝑧 =𝑧1, 𝑧2, …, 𝑧𝑛 (i.e., for 𝑖 = 1 to 𝑛): Draw 𝑧 𝑖 𝑡 from 𝑃 𝑧 𝑖 𝑧 −𝑖 , 𝑤 𝑧 −𝑖 = 𝑧 1 𝑡 , 𝑧 2 𝑡 ,…, 𝑧 𝑖−1 𝑡 , 𝑧 𝑖+1 𝑡−1 , …, 𝑧 𝑛 𝑡−1

16 Collapsed Gibbs Sampler for LDA
This is nicer than your average Gibbs sampler: Memory: counts (the “ 𝑛 ⋅ ⋅ ” counts) can be cached in two sparse matrices No special functions, simple arithmetic The distributions on Φ and Θ are analytic in topic assignments 𝑧 and 𝑤 , and can later be recomputed from the samples in a given iteration of the sampler:  from 𝑤 | 𝑧  from 𝑧

17 Gibbs sampling in LDA T=2 Nd= M=5 iteration 1

18 Gibbs sampling in LDA iteration

19 Gibbs sampling in LDA iteration

20 Gibbs sampling in LDA iteration

21 Gibbs sampling in LDA iteration

22 Gibbs sampling in LDA iteration

23 Gibbs sampling in LDA iteration

24 Gibbs sampling in LDA iteration

25 Gibbs sampling in LDA iteration

26 A Visual Example: Bars sample each pixel from a mixture of topics
pixel = word image = document A toy problem. Just a metaphor for inference on text.

27 Documents generated from the topics.

28 Evolution of the topics (𝜙 matrix)

29 Interpretable decomposition
SVD gives a basis for the data, but not an interpretable one The true basis is not orthogonal, so rotation does no good

30 Effects of Hyper-parameters
 and  control the relative sparsity of  and  smaller : fewer topics per document smaller : fewer words per topic Good assignments z are a compromise in sparsity

31 Effects of Hyper-parameters
 and  control the relative sparsity of  and  smaller : fewer topics per document smaller : fewer words per topic Good assignments z are a compromise in sparsity log (x) x

32 Bayesian model selection
How many topics do we need? A Bayesian would consider the posterior: Involves summing over assignments z P(T|w)  P(w|T) P(T)

33 Bayesian model selection
P( w |T ) T = 100 Corpus (w)

34 Bayesian model selection
P( w |T ) T = 100 Corpus (w)

35 Bayesian model selection
P( w |T ) T = 100 Corpus (w)

36 Sweeping T

37 Analysis of PNAS abstracts
Used all D = 28,154 abstracts from Used any word occurring in at least five abstracts, not on “stop” list (W = 20,551) Segmentation by any delimiting character, total of n = 3,026,970 word tokens in corpus Also, PNAS class designations for 2001 (Acknowledgment: Kevin Boyack)

38 Running the algorithm Memory requirements linear in T(W+D), runtime proportional to nT T = 50, 100, 200, 300, 400, 500, 600, (1000) Ran 8 chains for each T, burn-in of 1000 iterations, 10 samples/chain at a lag of 100 All runs completed in under 30 hours on BlueHorizon supercomputer at San Diego

39 How many topics?

40 Topics by Document Length
Not sure what this means

41 A Selection of Topics P(w | z)  FORCE SURFACE MOLECULES SOLUTION
SURFACES MICROSCOPY WATER FORCES PARTICLES STRENGTH POLYMER IONIC ATOMIC AQUEOUS MOLECULAR PROPERTIES LIQUID SOLUTIONS BEADS MECHANICAL HIV VIRUS INFECTED IMMUNODEFICIENCY CD4 INFECTION HUMAN VIRAL TAT GP120 REPLICATION TYPE ENVELOPE AIDS REV BLOOD CCR5 INDIVIDUALS ENV PERIPHERAL MUSCLE CARDIAC HEART SKELETAL MYOCYTES VENTRICULAR MUSCLES SMOOTH HYPERTROPHY DYSTROPHIN HEARTS CONTRACTION FIBERS FUNCTION TISSUE RAT MYOCARDIAL ISOLATED MYOD FAILURE STRUCTURE ANGSTROM CRYSTAL RESIDUES STRUCTURES STRUCTURAL RESOLUTION HELIX THREE HELICES DETERMINED RAY CONFORMATION HELICAL HYDROPHOBIC SIDE DIMENSIONAL INTERACTIONS MOLECULE NEURONS BRAIN CORTEX CORTICAL OLFACTORY NUCLEUS NEURONAL LAYER NUCLEI CEREBELLUM CEREBELLAR LATERAL CEREBRAL LAYERS GRANULE LABELED HIPPOCAMPUS AREAS THALAMIC TUMOR CANCER TUMORS CELLS BREAST MELANOMA GROWTH CARCINOMA PROSTATE NORMAL CELL METASTATIC MALIGNANT LUNG CANCERS MICE NUDE PRIMARY OVARIAN P(w | z) 

42 Cold topics Hot topics

43 Cold topics Hot topics 2 SPECIES GLOBAL CLIMATE CO2 WATER
ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

44 Cold topics Hot topics 37 CDNA AMINO SEQUENCE ACID PROTEIN ISOLATED
ENCODING CLONED ACIDS IDENTITY CLONE EXPRESSED ENCODES RAT HOMOLOGY 289 KDA PROTEIN PURIFIED MOLECULAR MASS CHROMATOGRAPHY POLYPEPTIDE GEL SDS BAND APPARENT LABELED IDENTIFIED FRACTION DETECTED 75 ANTIBODY ANTIBODIES MONOCLONAL ANTIGEN IGG MAB SPECIFIC EPITOPE HUMAN MABS RECOGNIZED SERA EPITOPES DIRECTED NEUTRALIZING 2 SPECIES GLOBAL CLIMATE CO2 WATER ENVIRONMENTAL YEARS MARINE CARBON DIVERSITY OCEAN EXTINCTION TERRESTRIAL COMMUNITY ABUNDANCE 134 MICE DEFICIENT NORMAL GENE NULL MOUSE TYPE HOMOZYGOUS ROLE KNOCKOUT DEVELOPMENT GENERATED LACKING ANIMALS REDUCED 179 APOPTOSIS DEATH CELL INDUCED BCL CELLS APOPTOTIC CASPASE FAS SURVIVAL PROGRAMMED MEDIATED INDUCTION CERAMIDE EXPRESSION

45 Conclusions Estimation/inference in LDA is more or less straightforward using Gibbs Sampling i.e., easy! Not so easy in all graphical models

46 Coming Soon Topical n-grams


Download ppt "Lecture #13: Gibbs Sampling for LDA"

Similar presentations


Ads by Google