Non-Parametric Models

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Hierarchical Dirichlet Process (HDP)
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Dynamic Bayesian Networks (DBNs)
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Chapter Two Probability Distributions: Discrete Variables
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Gaussian process regression Bernád Emőke Gaussian processes Definition A Gaussian Process is a collection of random variables, any finite number.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Gaussian Processes Li An Li An
Stick-Breaking Constructions
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Gaussian Processes For Regression, Classification, and Prediction.
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Univariate Gaussian Case (Cont.)
Gaussian Process Networks Nir Friedman and Iftach Nachman UAI-2K.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Estimating standard error using bootstrap
Univariate Gaussian Case (Cont.)
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Bayesian Generalized Product Partition Model
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Statistical Models for Automatic Speech Recognition
Overview of Supervised Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Collapsed Variational Dirichlet Process Mixture Models
Generalized Spatial Dirichlet Process Models
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
An introduction to Graphical Models – Michael Jordan
Pattern Recognition and Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Lecture # 2 MATHEMATICAL STATISTICS
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
CSLT ML Summer Seminar (2)
Presentation transcript:

Non-Parametric Models CSLT ML Summer Seminar (9) Non-Parametric Models Dong Wang

Content Introduction Gaussian process Dirichlet process

Parameteric models Usually we hope a model M(X) that is represented by a set of parameters, in any form (discriminative, descriptive,...) Model parameters are optimized with any reasonable criterion, by any reasonable optimization approach. The parameterized model is then used for any inference task, including prediction, classification, generation, or any others It is simple, isn’t it?

But things are not so straightforward Parameterization means ‘knowledge’ on the model Sometimes we don’t have such knowledge, then better to do in a non-parametric way Density estimation Nearest neighbour SVM

Non-parametric models We say that a learning algorithm is non-parametric if the complexity of the functions it can learn is allowed to grow as the amount of training data is increased. YoshuaBengio - 18 Aug 2008

Two examples Bayesian linear regression Clustering We impose a Gaussian prior, which leads to a random prediction function Can we forget the parameter, and set prior on all the possible prediction functions? Clustering In k-NN and GMM, we need to define the number of mixtures. How if there are a large amount of data so more mixtures are required?

Introduction Gaussian process Dirichlet process

Back to linear regression

Bayesian linear regression A Gaussian prior on W A Gaussian posterior on W A Gaussian on prediction t

Correlation is related to how far the two x’s A function view Treate y=ΦW as a random function The Gaussian on W leads to Gaussian distribution on the training data set Y=[y(x1),y(x2),y(x3),...y(xN)] We infact place a distribution on the function y. This distribution on function leads to a random function, and therefore is called a ‘random process’. The specification of the randomness on functions is difficult, and can be described on a finite set of samples. Here, it is Gaussian A stochastic process y(x) is specified by giving the joint probability distribution for any finite set of values y(x1), . . . , y(xN) in a consistent manner. Correlation is related to how far the two x’s

Gaussian process For a random function y(x), if its any set of samples follow joint Gaussian distribution with kernel K, then y(x) is said to be a Gaussian process. Usually we use zero as the mean of the joint Gaussian, so the kernel k(x,y) determines the entire property. Note that the gramma matrix K is from all the training data. As we will see, the prediction depends on all the training data directly. So this model (prior) is non-parametric!

Samples from Gaussian prior We can sample a function, by sampling the values at a limited number of locations (x1,x2,...xN) The sampling can be done one by one. For example, when sampling xi, ignoring (marginalizing) all xj (j<N),and condition on the sample that has been recognized.

Possible kernels https://en.wikipedia.org/wiki/Gaussian_process

GP for regression GP prior with any K!

Prediction 𝑝 𝑡 𝑁+1 𝑡 1:𝑁 =𝑁(𝑚 𝑋 𝑁+1 , 𝜎 2 ( 𝑋 𝑁+1 ))

Example with two samples

Prediction confidence

Gaussian process for classification In classification, the posterior probability function y=σ(wx)is limited, so not suitable as a Gaussian We instead set prior on the partial function a=wx

Prediction with Gaussian approximatoin Use variational approach to solve the problem Use Laplace approximation

Introduction Gaussian process Dirichlet process

Back to Gaussian mixture model π There are two randomness: z, and parameters μ and Σ. z N μ, Σ x

How if we don’t know K? The number of clusters should be pre-defined π The number of clusters should be pre-defined It is empirical and requires knowledge From Bayesian ∫P(x|Θ)P(Θ)dΘ, it is too many items. How about select some of them? Let the data decide how many items, and where are they. z N μ, Σ x

Change a perspective for data generation Before Sample K sets of parameters For each data, sample a indicator z, and then use the corresponding parameter to generate data Now For each data, sample a parameter, with some probability equals to previous values, and some probability to a new value (double randomness). Sample the data by the parameter

Sequential random sampling Two random space: the ones that have been sampled already, the one with the basic (continuous) unknown ones

Chinese resturant process A Chinese restaurant with an innite number of tables, each of which can seat an innite number of customers. Each customer select a seat, either occupied by others, or a new one

It’s not really infinite It defines a probability on the number of tables!

Now think about the parameter generation If we think the parameter generation process, instead of data generation process A stick-breaking Double randomness

Dirichlet process If we have a distribution H on prameter θ, we treat it as a function. We want the function random, i.e., let it to be process The process, again, should be defined in finite samples. Dirichlet process define the probability on any partition on θ to be a Dirichlet distribution, as follows, where G is a sampled distribution (like H): There are many works demonstrate it exists. But the most interesting one is the stick breaking.

Another definition

The equivalence of the two definitions A constructive definition of Dirichlet Priors, Jayaram Sethuraman. http://www3.stat.sinica.edu.tw/statistica/oldpdf/A4n216.pdf Stick-breaking constructed distribution

Contribution of the concentration parameter

Posterior probability Give some samples of θ, what the G should be like? Another DP! With a G, prediction for the next sample from G:

DP Mixture Sample a ‘discrete’ distribution G over θ Sample Φ for each data, and sample data using parameter model defined by Φ By observing some Φ, we can know something about G By observing some X, we can know the posterior probability on G

Gibbs sampling Sampling the parameters to get the best partition and parameters.

Some applications DP Topic model, no limit on the number of clusters Infinite state HMM Speaker diarization and scenary analysis, no limit on the number of speakers Theme detection Image segmentation ...

Wrap up We know some models or part of models can not be modeled by parameters, where non-parameteric methods are useful Gaussian process is a random function. The randomness on functions is defined as randomness on variables, which follow multivariant Gaussian. Dirichlet process is another random function, it is a random distribution. The randomness is defined on any partition. Sticking break and CRP are all good metaphor for the process. Non-parameteric Bayesian is a nice extension to the Bayesian framework, connecting random variables to random functions.