Topic Models.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Mixture Models and the EM Algorithm
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
A question Homophily: similar nodes ~= connected nodes Which is cause and which is effect? –Do birds of a feather flock together? (Associative sorting)
Reasoning Under Uncertainty: Bayesian networks intro Jim Little Uncertainty 4 November 7, 2014 Textbook §6.3, 6.3.1, 6.5, 6.5.1,
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Generative Topic Models for Community Analysis
Review: Bayesian learning and inference
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayesian Networks Alan Ritter.
Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Artificial Intelligence CS 165A Tuesday, November 27, 2007  Probabilistic Reasoning (Ch 14)
Read R&N Ch Next lecture: Read R&N
Bayes’ Nets  A Bayes’ net is an efficient encoding of a probabilistic model of a domain  Questions we can ask:  Inference: given a fixed BN, what is.
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayes Net Lab. 1. Absolute Independence vs. Conditional Independence For any three random variables X, Y, and Z, a)is it true that if X  Y, then X 
1 Monte Carlo Artificial Intelligence: Bayesian Networks.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Bayesian Networks: Independencies and Inference Scott Davies and Andrew Moore Note to other teachers and users of these slides. Andrew and Scott would.
Slide 1 Directed Graphical Probabilistic Models William W. Cohen Machine Learning Feb 2008.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Announcements Project 4: Ghostbusters Homework 7
Probabilistic Reasoning [Ch. 14] Bayes Networks – Part 1 ◦Syntax ◦Semantics ◦Parameterized distributions Inference – Part2 ◦Exact inference by enumeration.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Techniques for Dimensionality Reduction
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Analysis of Social Media MLD , LTI William Cohen
Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.
Scaling up LDA William Cohen. First some pictures…
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Slide 1 Directed Graphical Probabilistic Models: learning in DGMs William W. Cohen Machine Learning
Artificial Intelligence Bayes’ Nets: Independence Instructors: David Suter and Qince Li Course Harbin Institute of Technology [Many slides.
A Brief Introduction to Bayesian networks
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Qian Liu CSE spring University of Pennsylvania
Latent Dirichlet Allocation
Read R&N Ch Next lecture: Read R&N
Read R&N Ch Next lecture: Read R&N
CAP 5636 – Advanced Artificial Intelligence
Topic models for corpora and for graphs
CS 188: Artificial Intelligence
Directed Graphical Probabilistic Models: the sequel
Class #16 – Tuesday, October 26
LDA AND OTHER DIRECTED MODELS FOR MODELING TEXT
Topic models for corpora and for graphs
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

Topic Models

Outline Review of directed models Directed models for text Independencies, d-separation, and “explaining away” Learning for Bayes nets Directed models for text Naïve Bayes models Latent Dirichlet allocation (LDA)

Review of Directed Models (aka Bayes Nets)

Directed Model = Graph + Conditional Probability Distributions

The Graph  (Some) Pairwise Conditional Independencies

Plate notation lets us denote complex graphs =

Directed Models > HMMs P(S) s 1.0 t 0.0 u v Directed Models > HMMs S1 a S2 S3 c S4 S S’ P(S’|S) s 0.1 t 0.9 u 0.0 … 0.5 .. t v s 0.9 0.5 0.8 0.2 0.1 u a c 0.6 0.4 0.3 0.7 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. S X P(X|S) s a 0.9 c 0.1 t 0.6 0.4 .. …

Directed Models > HMMs a S2 S3 c S4 t v s 0.9 0.5 0.8 0.2 0.1 u a c 0.6 0.4 0.3 0.7 In previous models, pr(ai) depended just on symbols appearing before some distance but not on the position of the symbol, I.e. not on I. To model drifting/evolving sequences need something more powerful. Hidden Markov models provide one such option. Here states do not correspond to substrings hence the name hidden. There are two kinds of probabilities: transition like before but emission too. Calculating Pr(seq) not easy since all symbols can potentially be generated from all states. So not a single path to generate the models, multiple paths each with some probability However, easy to calculate joint probability of path and emitted symbol. Enumerate all possible paths and sum probability. Can do much better by exploiting markov property. 1 Important point: I can compute Pr(S2=t | aaca) So inference does not always “follow the arrows”

Some More DETAILS ON Directed Models The example police say we’re in violation: Insufficient use of “Monty Hall” problem Discussing Bayes nets without discussing burglar alarms

The (highly practical) Monty Hall problem P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 You’re in a game show. Behind one door is a car. Behind the others, goats. You pick one of three doors, say #1 The host, Monty Hall, opens one door, revealing…a goat! You now can either stick with your guess or change doors A B Stick, or swap? C D The revealed goat D P(D) Stick 0.5 Swap A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess

A few minutes later, the goat from behind door C drives away in the car.

The (highly practical) Monty Hall problem P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess A C D P(E|A,C,D) …

The (highly practical) Monty Hall problem We could construct the joint and compute P(E=B|D=swap) A P(A) 1 0.33 2 3 First guess The money B P(B) 1 0.33 2 3 A B …again by the chain rule: P(A,B,C,D,E) = P(E|A,C,D) * P(D) * P(C | A,B ) * P(B ) * P(A) Stick or swap? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess A C D P(E|A,C,D) …

The (highly practical) Monty Hall problem P(A) 1 0.33 2 3 The joint table has…? First guess The money B P(B) 1 0.33 2 3 A B 3*3*3*2*3 = 162 rows Stick or swap? The conditional probability tables (CPTs) shown have … ? Big questions: why are the CPTs smaller? how much smaller are the CPTs than the joint? can we compute the answers to queries like P(E=B|d) without building the joint probability tables, just using the CPTs? C The goat D A B C P(C|A,B) 1 2 0.5 3 1.0 … 3 + 3 + 3*3*3 + 2*3*3 = 51 rows < 162 rows E Second guess A C D P(E|A,C,D) …

The (highly practical) Monty Hall problem P(A) 1 0.33 2 3 Why is the CPTs representation smaller? Follow the money! (B) First guess The money B P(B) 1 0.33 2 3 A B Stick or swap? C The goat D E is conditionally independent of B given A,D,C A B C P(C|A,B) 1 2 0.5 3 1.0 … E Second guess A C D P(E|A,C,D) …

Conditional Independence (again) Definition: R and L are conditionally independent given M if for all x,y,z in {T,F}: P(R=x  M=y ^ L=z) = P(R=x  M=y) More generally: Let S1 and S2 and S3 be sets of variables. Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets, P(S1’s assignments  S2’s assignments & S3’s assignments)= P(S1’s assignments  S3’s assignments)

The (highly practical) Monty Hall problem First guess The money What are the conditional indepencies? I<A, {B}, C> ? I<A, {C}, B> ? I<E, {A,C}, B> ? I<D, {E}, B> ? … A B Stick or swap? C The goat D E Second guess

What Independencies does a Bayes Net Model? In order for a Bayesian network to model a probability distribution, the following must be true by definition: Each variable is conditionally independent of all its non-descendants in the graph given the value of all its parents. This implies But what else does it imply?

What Independencies does a Bayes Net Model? Example: Given Y, does learning the value of Z tell us nothing at all new about X? I.e., is P(X|Y, Z) equal to P(X | Y)? Yes. Since we know the value of all of X’s parents (namely, Y), and Z is not a descendant of X, X is conditionally independent of Z. Also, since independence is symmetric, P(Z|Y, X) = P(Z|Y). Z Y X

What Independencies does a Bayes Net Model? Let I<X,Y,Z> represent X and Z being conditionally independent given Y. I<X,Y,Z>? Yes, just as in previous example: All X’s parents given, and Z is not a descendant. Y X Z

Things get a little more confusing X has no parents, so we know all its parents’ values trivially Z is not a descendant of X So, I<X,{},Z>, even though there’s a undirected path from X to Z through an unknown variable Y. What if we do know the value of Y, though? Or one of its descendants? X Z Y

The “Burglar Alarm” example Your house has a twitchy burglar alarm that is also sometimes triggered by earthquakes. Earth arguably doesn’t care whether your house is currently being burgled While you are on vacation, one of your neighbors calls and tells you your home’s burglar alarm is ringing. Uh oh! Burglar Earthquake Alarm Phone Call

Things get a lot more confusing But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>! Burglar Earthquake Alarm Phone Call

“Explaining away” NO X Y E YES This is “explaining away”: E is common symptom of two causes, X and Y After observing E=1, both X and Y become more probable After observing E=1 and X=1, Y becomes less probable since X alone is enough to “explain” E

“Explaining away” and common-sense Historical note: Classical logic is monotonic: the more you know, the more you deduce. “Common-sense” reasoning is not monotonic birds fly but, not after being cooked for 20min/lb at 350o F This led to numerous “non-monotonic logics” for AI This examples shows that Bayes nets are not monotonic If P(Y|E) is “your belief” in Y after observing E, and P(Y|X,E) is “your belief” in Y after observing E,X your belief in Y decreases after you discover X

How can I make this less confusing? But now suppose you learn that there was a medium-sized earthquake in your neighborhood. Oh, whew! Probably not a burglar after all. Earthquake “explains away” the hypothetical burglar. But then it must not be the case that I<Burglar,{Phone Call}, Earthquake>, even though I<Burglar,{}, Earthquake>! Burglar Earthquake Alarm Phone Call

d-separation to the rescue Fortunately, there is a relatively simple algorithm for determining whether two variables in a Bayesian network are conditionally independent: d-separation. Definition: X and Z are d-separated by a set of evidence variables E iff every undirected path from X to Z is “blocked”, where a path is “blocked” iff one or more of the following conditions is true: ... ie. X and Z are dependent iff there exists an unblocked path

A path is “blocked” when... There exists a variable V on the path such that it is in the evidence set E the arcs putting Y in the path are “tail-to-tail” Or, there exists a variable V on the path such that the arcs putting Y in the path are “tail-to-head” Or, ... unknown “common causes” of X and Z impose dependency Y unknown “causal chains” connecting X an Z impose dependency Y

A path is “blocked” when… (the funky case) … Or, there exists a variable V on the path such that it is NOT in the evidence set E neither are any of its descendants the arcs putting Y on the path are “head-to-head” Known “common symptoms” of X and Z impose dependencies… X may “explain away” Z Y

Summary: d-separation X E Y There are three ways paths from X to Y given evidence E can be blocked. X is d-separated from Y given E iff all paths from X to Y given E are blocked If X is d-separated from Y given E, then I<X,E,Y> Z Z Z

Learning for Bayes Nets

(Review) Breaking it down: Learning parameters for the “naïve” HMM Training data defines unique path through HMM! Transition probabilities Probability of transitioning from state i to state j = number of transitions from i to j total transitions from state i Emission probabilities Probability of emitting symbol k from state i = number of times k generated from i number of transitions from i with smoothing, of course

(Review) Breaking it down: NER using the “naïve” HMM Define the HMM structure: one state per entity type Training data defines unique path through HMM for each labeled example Use this to estimate transition and emission probabilities At test time for a sequence x Use Viterbi to find sequence of states s that maximizes Pr(x|s) Use s to derive labels for the sequence x

Learning for Bayes nets ~ Learning for HMMS if everything is observed Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Learning Method (discrete variables): Estimate each CPT independently Use a MLE or MAP A B C P(C|A,B) 1 2 0.5 3 1.0 … E …

Learning for Bayes nets ~ Learning for HMMS if some things are not observed Input: Sample of the joint: Graph structure of the variables for I=1,…,N, you know Xi and parents(Xi) Output: Estimated CPTs A B B P(B) 1 0.33 2 3 C D Learning Method (discrete variables): Use inference* to estimate distribution of the unobserved values Use EM A B C P(C|A,B) 1 2 0.5 3 1.0 … E … * The HMM methods generalize to trees. I’ll talk about Gibbs sampling soon

LDA and Other Directed Models FOR MODELING TEXT

Supervised Multinomial Naïve Bayes Naïve Bayes Model: Compact representation   C C = ….. WN W1 W2 W3 W M N b M b

Supervised Multinomial Naïve Bayes Naïve Bayes Model: Compact representation   C C = ….. WN W1 W2 W3 W M N b M  K

Review – supervised Naïve Bayes Multinomial Naïve Bayes  For each class 1..K Construct a multinomial i For each document d = 1,, M Generate Cd ~ Mult( . | ) For each position n = 1,, Nd Generate wn ~ Mult(.|,Cd) … or if you prefer wn ~ Pr(w|Cd) C ….. WN W1 W2 W3 M b K

Review – unsupervised Naïve Bayes Mixture model: EM solution E-step: M-step: Key capability: estimate distribution of latent variables given observed variables

Review – unsupervised Naïve Bayes Mixture model: unsupervised naïve Bayes model Joint probability of words and classes: But classes are not visible:  Z C W N M b

Beyond Naïve Bayes - Probabilistic Latent Semantic Indexing (PLSI) Every document is a mixture of topics For i=1…K: Let bi be a multinomial over words For each document d: Let d be a distribution over {1,..,K} For each word position in d: Pick a topic z from d Pick a word w from bi Turns out to be hard to fit: Lots of parameters! Also: only applies to the training data  C Z W N M b K

The LDA Topic Model

LDA Motivation Assumptions: 1) documents are i.i.d 2) within a document, words are i.i.d. (bag of words) For each document d = 1,,M Generate d ~ D1(…) For each word n = 1,, Nd generate wn ~ D2( . | θdn) Now pick your favorite distributions for D1, D2  w N M

Latent Dirichlet Allocation “Mixed membership” Latent Dirichlet Allocation  For each document d = 1,,M Generate d ~ Dir(. | ) For each position n = 1,, Nd generate zn ~ Mult( . | d) generate wn ~ Mult( . | zn) a z w N M f K b

LDA’s view of a document

LDA topics

Review - LDA Latent Dirichlet Allocation Parameter learning: Variational EM Numerical approximation using lower-bounds Results in biased solutions Convergence has numerical guarantees Gibbs Sampling Stochastic simulation unbiased solutions Stochastic convergence

Review - LDA Gibbs sampling – works for any directed model! Applicable when joint distribution is hard to evaluate but conditional distribution is known Sequence of samples comprises a Markov Chain Stationary distribution of the chain is the joint distribution Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.

Why does Gibbs sampling work? What’s the fixed point? Stationary distribution of the chain is the joint distribution When will it converge (in the limit)? If graph defined by the chain is connected How long will it take to converge? Depends on second eigenvector of that graph

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables Fr: Parameter estimation for text analysis - Gregor Heinrich

LDA Latent Dirichlet Allocation “Mixed membership”  a f b Randomly initialize each zm,n Repeat for t=1,…. For each doc m, word n Find Pr(zmn=k|other z’s) Sample zmn according to that distr. a z w N M f K b

EVEN More Detail On LDA…

Way way more detail

More detail

What gets learned…..

In A Math-ier Notation N[*,k] N[d,k] N[*,*]=V M[w,k]

for each document d and word position j in d z[d,j] = k, a random topic N[d,k]++ W[w,k]++ where w = id of j-th word in d

for each pass t=1,2,…. for each document d and word position j in d z[d,j] = k, a new random topic update N, W to reflect the new assignment of z: N[d,k]++; N[d,k’] - - where k’ is old z[d,j] W[w,k]++; W[w,k’] - - where w is w[d,j]

Some comments on LDA Very widely used model Also a component of many other models