Bayesian Generative Modeling

Slides:



Advertisements
Similar presentations
Hierarchical Dirichlet Process (HDP)
Advertisements

Information retrieval – LSI, pLSI and LDA
Title: The Author-Topic Model for Authors and Documents
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Statistical Topic Modeling part 1
1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Crash Course on Machine Learning
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Integrating Topics and Syntax -Thomas L
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
CSE 517 Natural Language Processing Winter 2015
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Natural Language Processing Vasile Rus
Neural Machine Translation
Overview of Statistical Language Models
Local factors in a graphical model
Online Multiscale Dynamic Topic Models
Statistical Language Models
From last time: on-policy vs off-policy Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy.
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Exam Review Session William Cohen.
Artificial Intelligence
CSCI 5822 Probabilistic Models of Human and Machine Learning
CAP 5636 – Advanced Artificial Intelligence
Latent Dirichlet Analysis
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Matching Words with Pictures
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Inference for Mixture Language Models
CS 188: Artificial Intelligence
CSCI 5832 Natural Language Processing
Resource Recommendation for AAN
Michal Rosen-Zvi University of California, Irvine
Latent Dirichlet Allocation
CS246: Latent Dirichlet Analysis
CPSC 503 Computational Linguistics
Topic Models in Text Processing
Word embeddings (continued)
INF 141: Information Retrieval
Dan Roth Department of Computer Science
A Joint Model of Orthography and Morphological Segmentation
CS249: Neural Language Model
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Bayesian Generative Modeling Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 1

Bayesian Generative Modeling what’s a model? Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 2 2

Bayesian Generative Modeling what’s a generative model? Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 3 3

Bayesian Generative Modeling what’s Bayesian? Jason Eisner Summer School on Machine Learning Lisbon, Portugal – July 2011 4 4

Task-centric view of the world x y evaluation (loss function) e.g., p(y|x) model and decoder

Task-centric view of the world x y loss function p(y|x) model Great way to track progress & compare systems But may fracture us into subcommunities (our systems are incomparable & my semantics != your semantics) Room for all of AI when solving any NLP task Spelling correction could get some benefit from deep semantics, unsupervised grammar induction, active learning, discourse, etc. But in practice, focus on raising a single performance number Within strict, fixed assumptions about the type of available data Do we want to build models & algs that are good for just one task? Spelling correction: Humans do all kinds of reasoning about what was really meant in this context; they can even whether the misspelling was deliberate (pun, persona, implicit quotation, other metalinguistic behavior). But as an engineering matter, this doesn’t help much? Do we want to build models & algs that are good for just one task? There are some tasks that are genuinely worth solving as formulated, at least for now. Others are set up as academic exercises to motivate research, and there they may motivate peculiar research directions (e.g., RTE using n-grams). Either way, when the task parameters change, a lot of effort falls on the floor. Of course it keeps us busy always building the next system; but we may be missing opportunities for fundamental progress and reuse. Surely humans do tune to particular tasks, for both speed and accuracy, and I’ll say more about that at the end. But that is perhaps on top of an architecture that also lets them puzzle out the answers new tasks without training from scratch.

Variable-centric view of the world Focus on what variables might help us understand and produce language. Some are token variables (utterance-specific), others are type variables (language-specific). Might be observed or latent. How those variables are represented, what the priors are, and (next slide) how they interrelate. When we deeply understand language, what representations (type and token) does that understanding comprise?

Bayesian View of the World observed data probability distribution hidden data

Different tasks merely change which variables are observed and which ones you care about inferring comprehension production learning sentence  ? syntax tree (?) latent () semantics facts about speaker/world facts about the language

Different tasks merely change which variables are observed and which ones you care about inferring comprehension production learning surface form of word  ? surface  underlying alignment (?) latent () underlying form of word abstract morphemes in word underlying form of morphemes (lexicon) constraint ranking (grammar)

Different tasks merely change which variables are observed and which ones you care about inferring MT decoding MT training cross-lingual projection Chinese sentence  Chinese parse latent ? English parse English sentence translation & language models

∑latent p(observed, needed, latent) All you need is “p” Science = a descriptive theory of the world Write down a formula for p(everything) everything = observed  needed  latent Given observed, what might needed be? Most probable settings of needed are those that give comparatively large values of ∑latent p(observed, needed, latent) Formally, we want p(needed | observed) = p(observed, needed) / p(observed) Since observed is constant, the conditional probability of needed varies with p(observed, needed), which is given above (What do we do then?)

All you need is “p” Science = a descriptive theory of the world Write down a formula for p(everything) everything = observed  needed  latent p can be any non-negative function you care to design (as long as it sums to 1) (or another finite positive number: just rescale) But it’s often convenient to use a graphical model Flexible modeling technique Well understood We know how to (approximately) compute with them

Graphical model notation slide thanks to Zoubin Ghahramani

slide thanks to Zoubin Ghahramani Factor graphs slide thanks to Zoubin Ghahramani

Rather basic NLP example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) … … v v v find preferred tags Observed input sentence (shaded) 16 16

Rather basic NLP example First, a familiar example Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n find preferred tags Observed input sentence (shaded) 17 17

Conditional Random Field (CRF) ”Binary” factor that measures compatibility of 2 adjacent tags v n a 2 1 3 v n a 2 1 3 Model reuses same parameters at this position … … find preferred tags 18 18

Conditional Random Field (CRF) v n a 2 1 3 v n a 2 1 3 … … v 0.3 n 0.02 a v 0.3 n a 0.1 v 0.2 n a find preferred can’t be adj tags 19 19

Conditional Random Field (CRF) p(v a n) is proportional to the product of all factors’ values on v a n v n a 2 1 3 v n a 2 1 3 … … v a n v 0.3 n 0.02 a v 0.3 n a 0.1 v 0.2 n a find preferred tags 20 20

Conditional Random Field (CRF) p(v a n) is proportional to the product of all factors’ values on v a n v n a 2 1 3 v n a 2 1 3 = … 1*3*0.3*0.1*0.2 … … … v a n v 0.3 n 0.02 a v 0.3 n a 0.1 v 0.2 n a find preferred tags MRF vs. CRF? 21 21

Inference: What do you know how to compute with this model? p(v a n) is proportional to the product of all factors’ values on v a n v n a 2 1 3 v n a 2 1 3 = … 1*3*0.3*0.1*0.2 … … … v a n v 0.3 n 0.02 a v 0.3 n a 0.1 v 0.2 n a find preferred tags Maximize, sample, sum … 22 22

Variable-centric view of the world Focus on what variables might help us understand and produce language. Some are token variables (utterance-specific), others are type variables (language-specific). Might be observed or latent. How those variables are represented, what the priors are, and (next slide) how they interrelate. When we deeply understand language, what representations (type and token) does that understanding comprise?

To recover variables, model and exploit their correlations semantics lexicon (word types) inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens sentences N translation alignment editing quotation resources I don’t want to learn to do tasks but rather learn to understand what’s going on in the language I see; happy to take help from direct or indirect supervision or other resources speech misspellings,typos formatting entanglement annotation discourse context To recover variables, model and exploit their correlations

How do you design the factors? It’s easy to connect “English sentence” to “Portuguese sentence” … … but you have to design a specific function that measures how compatible a pair of sentences is. Often, you can think of a generative story in which the individual factors are themselves probabilities. May require some latent variables.

Directed graphical models (Bayes nets) Under any model: p(A, B, C, D, E) = p(A)p(B|A)p(C|A,B)p(D|A,B,C)p(E|A,B,C,D) Model above says: slide thanks to Zoubin Ghahramani (modified)

Unigram model for generating text … w1 w2 w3 p(w1)  p(w2)  p(w3) …

Explicitly show model’s parameters  “ is a vector that says which unigrams are likely” …  w1 w2 w3 p()  p(w1 | )  p(w2 | )  p(w3 | ) …

“Plate notation” simplifies diagram “ is a vector that says which unigrams are likely”  w N1 p()  p(w1 | )  p(w2 | )  p(w3 | ) …

Learn  from observed words (rather than vice-versa) p()  p(w1 | )  p(w2 | )  p(w3 | ) …

Explicitly show prior over  (e.g., Dirichlet) “Even if we didn’t observe word 5, the prior says that 5 = 0 is a terrible guess”  given   Dirichlet() wi     w N1 p()  p( | )  p(w1 | )  p(w2 | )  p(w3 | ) …

Dirichlet Distribution Each point on a k dimensional simplex is a multinomial probability distribution: 1 dog the cat 1 1 dog the cat slide thanks to Nigel Crook

Dirichlet Distribution A Dirichlet Distribution is a distribution over multinomial distributions  in the simplex. 1 1 1 1 1 slide thanks to Nigel Crook

slide thanks to Percy Liang and Dan Klein

Dirichlet Distribution Example draws from a Dirichlet Distribution over the 3-simplex: Dirichlet(5,5,5) Dirichlet(0.2, 5, 0.2) 1 Dirichlet(0.5,0.5,0.5) slide thanks to Nigel Crook

Explicitly show prior over  (e.g., Dirichlet) Posterior distribution p( | , w) is also a Dirichlet just like the prior p( | ). “Even if we didn’t observe word 5, the prior says that 5 = 0 is a terrible guess” prior = Dirichlet()  posterior = Dirichlet(+counts(w)) Mean of posterior is like the max-likelihood estimate of , but smooth the corpus counts by adding “pseudocounts” . (But better to use whole posterior, not just the mean.)   w N1 p()  p( | )  p(w1 | )  p(w2 | )  p(w3 | ) …

Training and Test Documents “Learn  from document 1, use it to predict document 2” test What do good configurations look like if N1 is large? What if N1 is small? w N2 train   w N1

Many Documents 3 w 2 w  1 w “Each document has its own unigram model” 3 w N3 Now does observing docs 1 and 3 help still predict doc 2? Only if  learns that all the ’s are similar (low variance). And in that case, why even have separate ’s? 2 w N2  1 w N1

“Each document has its own unigram model” Many Documents “Each document has its own unigram model” or tuned to maximize training or dev set likelihood  given d  Dirichlet() wdi  d   w ND D

Bayesian Text Categorization “Each document chooses one of only K topics (unigram models)”  given k  Dirichlet() wdi  k but which k?   w K ND D

Bayesian Text Categorization  given   Dirichlet() zd   “Each document chooses one of only K topics (unigram models)”  a distribution over topics 1…K  given k  Dirichlet() wdi  zd  Allows documents to differ considerably while some still share  parameters. And, we can infer the probability that two documents have the same topic z. Might observe some topics. a topic in 1…K z   w K ND D

Latent Dirichlet Allocation (Blei, Ng & Jordan 2003) “Each document chooses a mixture of all K topics; each word gets its own topic”   z   w K ND D

(Part of) one assignment to LDA’s variables slide thanks to Dave Blei

(Part of) one assignment to LDA’s variables slide thanks to Dave Blei

Latent Dirichlet Allocation: Inference?   … z1 z2 z3 …   w w1 w2 w3 K K D

Finite-State Dirichlet Allocation (Cui & Eisner 2006) “A different HMM for each document”   … z1 z2 z3 …   w1 w2 w3 K D

Variants of Latent Dirichlet Allocation Syntactic topic model: A word or its topic is influenced by its syntactic position. Correlated topic model, hierarchical topic model, …: Some topics resemble other topics. Polylingual topic model: All versions of the same document use the same topic mixture, even if they’re in different languages. (Why useful?) Relational topic model: Documents on the same topic are generated separately but tend to link to one another. (Why useful?) Dynamic topic model: We also observe a year for each document. The k topics  used in 2011 have evolved slightly from their counterparts in 2010.

slide thanks to Dave Blei Dynamic Topic Model slide thanks to Dave Blei

slide thanks to Dave Blei Dynamic Topic Model slide thanks to Dave Blei

slide thanks to Dave Blei Dynamic Topic Model slide thanks to Dave Blei

slide thanks to Dave Blei Dynamic Topic Model slide thanks to Dave Blei

Remember: Finite-State Dirichlet Allocation (Cui & Eisner 2006) “A different HMM for each document”   … z1 z2 z3 …   w1 w2 w3 K D

“Shared HMM for all documents” (or just have 1 document) Bayesian HMM “Shared HMM for all documents” (or just have 1 document)   … z1 z2 z3   w1 w2 w3 K D We have to estimate transition parameters  and emission parameters .

FIN