B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,

Slides:

Advertisements

Similar presentations

A Tutorial on Learning with Bayesian Networks

Advertisements

Generative learning methods for bags of features

Statistical Topic Modeling part 1

Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.

Latent Dirichlet Allocation a generative model for text

Unsupervised discovery of visual object class hierarchies Josef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU)

Generative learning methods for bags of features

A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images.

Probabilistic Latent Semantic Analysis

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Topic models for corpora and for graphs. Motivation Social graphs seem to have –some aspects of randomness small diameter, giant connected components,..

Discriminative and generative methods for bags of features

Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

27. May Topic Models Nam Khanh Tran L3S Research Center.

Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Discovering Objects and their Location in Images Josef Sivic 1, Bryan C. Russell 2, Alexei A. Efros 3, Andrew Zisserman 1 and William T. Freeman 2 Goal:

CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Analysis of Social Media MLD , LTI William Cohen

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.

Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

The topic discovery models

Statistical Models for Automatic Speech Recognition

Multimodal Learning with Deep Boltzmann Machines

Data Mining Lecture 11.

CS 4/527: Artificial Intelligence

The topic discovery models

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

Latent Dirichlet Analysis

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CSCI 5822 Probabilistic Models of Human and Machine Learning

The topic discovery models

Bayesian Inference for Mixture Language Models

CS 188: Artificial Intelligence

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Topic models for corpora and for graphs

CONTEXT DEPENDENT CLASSIFICATION

Class #19 – Tuesday, November 3

Michal Rosen-Zvi University of California, Irvine

CS 188: Artificial Intelligence Fall 2008

Expectation-Maximization & Belief Propagation

Latent Dirichlet Allocation

CS246: Latent Dirichlet Analysis

Junghoo “John” Cho UCLA

Topic models for corpora and for graphs

Topic Models in Text Processing

Text Categorization Berlin Chen 2003 Reference:

Part 1: Bag-of-words models

Clustering (2) & EM algorithm

Presentation transcript:

Generative (Bayesian) modeling 04/04/2016

B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz, Slides by (credit to): David M. Blei Andrew Y. Ng, Michael I. Jordan, Ido Abramovich, L. Fei-Fei, P. Perona, J. Sivic, B. Russell, A. Efros, A. Zisserman, B. Freeman, Tomasz Malisiewicz, Thomas Huffman, Tom Landauer and Peter Foltz, Melanie Martin, Hsuan-Sheng Chiu, Haiyan Qiao, Jonathan Huang Thank you!

Generative modeling unsupervised learning … beyond clustering How can we describe/model the world for the computer? Bayesian networks!

Bayesian networks Directed acyclic graphs (DAG) whose nodes represent random variables Arcs represent (directed) dependence between random variables

Bayesian networks Filled nodes: observable variables Empty nodes: hidden (not observable) variables Zi wi1 w2i w3i w4i

Collapsed notation of Bayesian networks Frames indicates multiplications E.g. N features and M instances: Zi wi1 w2i w3i w4i

Generative (Bayesian) modeling Find the parameters of the given model which explains/reconstruct the observed data Model „Generative story” DATA

Model = Bayesian network „Generative story” Model = Bayesian network The structure of the network is given by the human engineer The form of the nodes’ distribution (conditioned on their parents) is given as well The parameters of the distributions have to be estimated from data

Parameter estimation in Bayesian network – only observable variables Bayesian network assumes that the variables only (directly) dependent from their parents → parameter estimation at each node can be carried out separetly Maximum Likelihood (or Bayesian estimation)

Expectation-Maximisation (EM) The extension of Maximum Likelihood parameter estimation if hidden variables are present We search for the parameter vector Φ which maximises the likelihood of the joint of observable variables X and hidden ones Z

Expectation-Maximisation (EM) Iterative algorithm. Step l: (E)xpectation step: estimate the values of Z (calculate expected values) using Φl (M)aximization step: Maximum likelihood estimetion by using Z

EM example There are two coins. We drop them together but we can observe only the sum of the heads: h(0)=4 h(1)=9 h(2)=2 What is the bias of the coins? Φ1=P1(H), Φ2=P2(H) ?

EM example a single z hidden variable: what is the proportion of the first coin out of h(1)=9 init Φ10=0.2 Φ20=0.5 E-step

EM example M-step

Text classification/clustering E.g. recognition of documents’ topic or clustering images based on their content „Bag-of-words model” The term-document matrix:

Image Bag-of-”words”

The dictionary consists of M words N documents: D={d1, … ,dN} The dictionary consists of M words W={w 1 , … ,w M} The size of the term-document matrix is N * M and it contains the number of occurances of a certain word in a certain document

Drawbacks of the bag-of-words model Word order is ignored Synonyms: We refer to a concept (object) by multiple words, e.g: tired-sleepy → low recall Polysemy: most of words have multiple senses, pl: bank, chips → low precision

Document clustering – unigram model Let’s assign a „topic” to each document The topics are hidden variables

Generative story of „unigram model” How documents generate? „Drop” a topic (and a size) For each word position drop a word according to the topic’s distribution TOPIC Word ... Word

Unigram model Each M documents, Drop a topic z. Zi wi1 w2i w3i w4i not clear what the plates are and what N and M are, and how you relate this to a naïve bayes model/ between this slide and the previous one, add a slide introducing the naïve bayes model. add a slide explaining why naïve bayes is not good (you are picking a single class for each document). you need to say more precisely what the plate means, i.e., parameter sharing in the cpts Each M documents, Drop a topic z. Drop a word (independently from the others) from a multinomial distribution conditiond on z

EM for clustering E-step M-step

pLSA The distributions found are interpretable Probabilistic Latent Semantic Analysis We assign a distribution of topics to each of the document Topics still have a distrbution over words The distributions found are interpretable

Relation to clustering… A document can belong to multiple clusters We’re interested in the distribution of topics rather than pushing each doc into a cluster → more flexible

Generative story of pLSA How documents generate? Generate the document’s topic distribution For each word position drop a topic from the doc’s topic distribution drop a word according to the topic’s distribution TOPIC distribution ... TOPIC TOPIC word ... word

Example money money loan bank DOCUMENT 1: money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 money1 stream2 bank1 money1 bank1 bank1 loan1 river2 stream2 bank1 money1 river2 bank1 money1 bank1 loan1 bank1 money1 stream2 .8 loan bank bank loan .2 TOPIC 1 .3 DOCUMENT 2: river2 stream2 bank2 stream2 bank2 money1 loan1 river2 stream2 loan1 bank2 river2 bank2 bank1 stream2 river2 loan1 bank2 stream2 bank2 money1 loan1 river2 stream2 bank2 stream2 bank2 money1 river2 stream2 loan1 bank2 river2 bank2 money1 bank1 stream2 river2 bank2 stream2 bank2 money1 river bank river .7 stream river bank stream TOPIC 2

Parameter estimation (model fitting, training) ? DOCUMENT 1: money? bank? bank? loan? river? stream? bank? money? river? bank? money? bank? loan? money? stream? bank? money? bank? bank? loan? river? stream? bank? money? river? bank? money? bank? loan? bank? money? stream? TOPIC 1 ? DOCUMENT 2: river? stream? bank? stream? bank? money? loan? river? stream? loan? bank? river? bank? bank? stream? river? loan? bank? stream? bank? money? loan? river? stream? bank? stream? bank? money? river? stream? loan? bank? river? bank? money? bank? stream? river? bank? stream? bank? money? ? TOPIC 2

pLSA Observable data Term distributions Topic distributions over topics Topic distributions For documents For that we will use a method called probabilistic latent semantic analysis. pLSA can be thought of as a matrix decomposition. Here is our term-document matrix (documents, words) and we want to find topic vectors common to all documents and mixture coefficients P(z|d) specific to each document. Note that these are all probabilites which sum to one. So a column here is here expressed as a convex combination of the topic vectors. What we would like to see is that the topics correspond to objects. So an image will be expressed as a mixture of different objects and backgrounds. Slide credit: Josef Sivic

Latent semantic analysis (LSA)

Generative story of pLSA How documents generate? Generate the document’s topic distribution For each word position drop a topic from the doc’s topic distribution drop a word according to the topic’s distribution TOPIC distribution ... TOPIC TOPIC word ... word

pLSA modell For each document d and word position: For each word position drop a topic from the doc’s topic distribution drop a word according to the topic’s distribution d zd1 zd2 zd3 zd4 between this slide and the previous one, add a motivating example where you have a latent variable in naïve bayse show an unrolled bn for this model using your running example (text classification) wd1 wd2 wd3 wd4

pLSA for images (example) w N d z D “eye” Sivic et al. ICCV 2005

pLSA – parameter estimation

pLSA – E-step What is the expected value of hidden variables (topics z) if the parameter values are fixed

pLSA – M-step We use the values of hidden hidden variables p(z|d,w)

EM algorithm It can converge to local optimum Stopping condition?

Approximate inference The E-step in huge networks is unfeasible There are approaches which are fast but do only approximate inference (=E-step) rather than exact one The most popular approximate inferece method is: Drop samples according to Bayesian network The average of the samples can be used as expected values of hidden variables

Markov Chain Monte Carlo method (MCMC) MCMC is an approximate inference method The samples are not independent from each other but they are generated one by one based on the previous sample (form a chain) Gibbs sampling is the most famous MCMC method: The next sample is generated by fixing all but variables and drop a value for the non-fixed one conditioned on the other ones

Outlook

Drawbacks of pLSA It can be recalculated from scratch if a new document arrives The number of parameters increases as the function of the number of instances d is just an index, it doesn’t fit well into the generative story

1990 1999 2003

2010

An even more complex task Recognise objects inside images without any supervision After training the model can be applied for unknown images as well

Summary Generative (Bayesian) modeling enables to define any description/model of the world with any complexity (clustering is only an example for it) EM algorithm is a general tool for solving parameter estimation problems where latent variables is incorporated