HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Teg Grenager NLP Group Lunch February 24, 2005
Xiaolong Wang and Daniel Khashabi
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
Probabilistic models Haixu Tang School of Informatics.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Learning Scalable Discriminative Dictionaries with Sample Relatedness a.k.a. “Infinite Attributes” Jiashi Feng, Stefanie Jegelka, Shuicheng Yan, Trevor.
Visual Recognition Tutorial
The Binomial Probability Distribution and Related Topics
Latent Dirichlet Allocation a generative model for text
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Nonparametric Bayesian Learning
1 Finite Population Inference for Latent Values Measured with Error from a Bayesian Perspective Edward J. Stanek III Department of Public Health University.
1 Engineering Computation Part 5. 2 Some Concepts Previous to Probability RANDOM EXPERIMENT A random experiment or trial can be thought of as any activity.
Visual Recognition Tutorial
Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.
Hierarchical Bayesian Nonparametrics with Applications Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Chapter Two Probability Distributions: Discrete Variables
Pairs of Random Variables Random Process. Introduction  In this lecture you will study:  Joint pmf, cdf, and pdf  Joint moments  The degree of “correlation”
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Randomized Algorithms for Bayesian Hierarchical Clustering
CS433 Modeling and Simulation Lecture 03 – Part 01 Probability Review 1 Dr. Anis Koubâa Al-Imam Mohammad Ibn Saud University
Variational Inference for the Indian Buffet Process
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Stick-Breaking Constructions
Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
Basics on Probability Jingrui He 09/11/2007. Coin Flips  You flip a coin Head with probability 0.5  You flip 100 coins How many heads would you expect.
Latent Dirichlet Allocation
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Gaussian Processes For Regression, Classification, and Prediction.
Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Completely Random Measures for Bayesian Nonparametrics Michael I. Jordan University of California, Berkeley Acknowledgments: Emily Fox, Erik Sudderth,
Bayesian Semi-Parametric Multiple Shrinkage
Bayesian Generalized Product Partition Model
CS 2750: Machine Learning Density Estimation
Non-Parametric Models
CSCI 5822 Probabilistic Models of Human and Machine Learning
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Rational models of categorization
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Presentation transcript:

HW 4

Nonparametric Bayesian Models

Parametric Model Fixed number of parameters that is independent of the data we’re fitting

Nonparametric Model Number of free parameters grows with amount of data Potentially infinite dimensional parameter space Only a finite subset of parameters are used in a nonparametric model to explain a finite amount of data  Model complexity grows with amount of data

Example: k Nearest Neighbor (kNN) Classifier x x x x o x o o o o ? ? ?

Bayesian Nonparametric Models Model is based on an infinite dimensional parameter space But utilizes only a finite subset of available parameters on any given (finite) data set i.e., model complexity is finite but unbounded Typically  Parameter space consists of functions or measures  Complexity is limited by marginalizing out over surplus dimensions nonnegative function over sets

Content of most slides borrowed from Zhoubin Ghahramani and Michael Jordan For parametric models, we do inference on random variables θ For nonparametric models, we do inference on stochastic processes (‘infinite-dimensional random variable’)

What Will This Buy Us?

Intuition: Mixture Of Gaussians Standard GMM has a fixed number of components. θ: means and variances Quiz: What sort of prior would you put on π? On θ?

Intuition: Mixture Of Gaussians Standard GMM has a fixed number of components. Equivalent form: But suppose instead we had G: mixing distribution = 1 unit of probability mass iff θ k =θ

Being Bayesian Can we define a prior over π? Yes: stick-breaking process Can we define a prior over the mixing distribution G? Yes: Dirichlet process

Stick Breaking Imagine breaking a stick by recursively breaking off bits of the remaining stick Formally, define infinite sequence of beta RVs: And an infinite sequence based on the {β i } Produces distribution on countably infinite space

Dirichlet Process Stick breaking gave us For each k we draw θ k ~ G 0 And define a new function The distribution of G is known as a Dirichlet process G ~ DP(α, G 0 ) Borrowed from Gharamani tutorial infinite dimensional Dirichlet distribution

Dirichlet Process Stick breaking gave us For each k we draw θ k ~ G 0 And define a new function The distribution of G is known as a Dirichlet process G ~ DP(α, G 0 ) QUIZ For GMM, what is θ k ? For GMM, what is θ? For GMM, what is a draw from G? For GMM, how do we get draws that have fewer mixture components? For GMM, how do we set G 0 ? What happens to G as α-> ∞ ?

Dirichlet Process II For all finite partitions (A 1, A 2, A 3, …, A K ) of Θ, if G ~ DP(α, G 0 ) What is G(A i )? Note: partitions do not have to be exhaustive Adapted from Gharamani tutorial function

Drawing From A Dirichlet Process DP is a distribution over discrete distributions G ~ DP(α, G 0 ) Therefore, as you draw more points from G, you are more likely to get repetitions. φ i ~ G So you can think about a DP as inducing a partitioning of the points by equality φ i = φ 3 = φ 4 ≠ φ 2 = φ 5 Chinese restaurant process (CRP) induces the corresponding distribution over these partitions CRP: generative model for (1) sampling from DP, then (2) sampling from G How does this relate to GMM?

Chinese Restaurant Process: Informal Description Borrowed from Jordan lecture

Chinese Restaurant Process: Formal Description Borrowed from Gharamani tutorial θ1θ1 θ1θ1 θ3θ3 θ3θ3 θ2θ2 θ2θ2 θ4θ4 θ4θ4       meal (instance) meal (type)

Comments On CRP Rich get richer phenomenon The popular tables are more likely to attract new patrons CRP produces a sample drawn from G, which in turn is drawn from the DP, without explicitly specifying G Analogous to how we could sample the outcome of a biased coin flip (H, T) without explicitly specifying coin bias ρ ρ ~ Beta(α,β) X ~ Bernoulli(ρ)

Infinite Exchangeability of CRP Sequence of variables X 1, X 2, X 3, …, X n is exchangeable if the joint distribution is invariant to permutation. With σ any permutation of {1, …, n}, An infinite sequence is infinitely exchangeable if any subsequence is exchangeable. Quiz  Relationship to iid (indep., identically distributed)?

Inifinite Exchangeability of CRP Probability of a configuration is independent of the particular order that individuals arrived Convince yourself with a simple example: θ1θ1 θ1θ1 θ2θ2 θ2θ2    θ1θ1 θ1θ1 θ2θ2 θ2θ2     θ3θ3 θ3θ3    θ3θ3 θ3θ3  

De Finetti (1935) If {X i } is exchangeable, there is a random θ such that: If {X i } is infinitely exchangeable, then θ may be a stochastic process (infinite dimensional). Thus, there exists a hierarchical Bayesian model for the observations {X i }.

Consequence Of Exchangeability Easy to do Gibbs sampling This is collapsed Gibbs sampling  feasible because DP is a conjugate prior on a multinomial draw

Dirichlet Process: Conjugacy Borrowed from Gharamani tutorial

CRP-Based Gibbs Sampling Demo

Dirichlet Process Mixture of Gaussians Instead of prespecifying number of components, draw parameters of mixture model from a DP → infinite mixture model

Sampling From A DP Mixture of Gaussians Borrowed from Gharamani tutorial

Parameters Vs. Partitions Rather than a generative model that spits out mixture component parameters, it could equivalently spit out partitions of the data. Use s i to denote the partition or indicator of x i Casting problem in terms of indicators will allow us to use the CRP Let’s first analyze the finite mixture case sisi

Bayesian Mixture Model (Finite Case) Borrowed from Gharamani tutorial

Bayesian Mixture Model (Finite Case) Integrating out the mixing proportions, π, we obtain Allows for Gibbs sampling over posterior of indicators Rich get richer effect more populous classes are likely to be joined

From Finite To Infinite Mixtures Finite case Infinite case

Don’t The Observations Matter? Yes! Previous slides took a short cut and ignored the data (x) and parameters (θ) Gibbs sampling should reassign indicators, {s i }, conditioned on all other variables sisi

Partitioning Performed By CRP You can think about CRP as creating a binary matrix Rows are diners Columns are tables Cells indicate assignment of diners to tables Columns are mutually exclusive ‘classes’ E.g., in DP Mixture Model Infinite number of columns in matrix

More General Prior On Binary Matrices Allow each individual to be a member of multiple classes … or to be represented by multiple features ‘distributed representation’ E.g., an individual is male, married, Democrat, fan of CU Buffs, etc. As with CRP matrix, fixed number of rows, infinite number of columns But no constraint on number of columns that can be nonzero in a given row

Finite Binary Feature Matrix Borrowed from Gharamani tutorial K N

Binary Matrices In Left-Ordered Form Borrowed from Gharamani tutorial

Indian Buffet Process Number of diners who chose dish k already

IBP Example (Griffiths & Ghahramani, 2006)

Ghahramani’s Model Space

Hierarchical Dirichlet Process (HDP) Suppose you want to model where people hang out in a town. Not known in advance how many locations need to be modeled Some spots in town are generally popular, others not so much. But individuals also have preferences that deviate from the population preference. E.g., bars are popular, but not for individuals who don’t drink Need to model distribution over locations at level of both population and individual.

Hierarchical Dirichlet Process Population distribution Individual distribution

Other Stick Breaking Processes Borrowed from Gharamani tutorial