Non-Parametric Models

Non-Parametric Models
CSLT ML Summer Seminar (9) Non-Parametric Models Dong Wang

Content Introduction Gaussian process Dirichlet process

Parameteric models Usually we hope a model M(X) that is represented by a set of parameters, in any form (discriminative, descriptive,...) Model parameters are optimized with any reasonable criterion, by any reasonable optimization approach. The parameterized model is then used for any inference task, including prediction, classification, generation, or any others It is simple, isn’t it?

But things are not so straightforward
Parameterization means ‘knowledge’ on the model Sometimes we don’t have such knowledge, then better to do in a non-parametric way Density estimation Nearest neighbour SVM

Non-parametric models
We say that a learning algorithm is non-parametric if the complexity of the functions it can learn is allowed to grow as the amount of training data is increased. YoshuaBengio - 18 Aug 2008

Two examples Bayesian linear regression Clustering
We impose a Gaussian prior, which leads to a random prediction function Can we forget the parameter, and set prior on all the possible prediction functions? Clustering In k-NN and GMM, we need to define the number of mixtures. How if there are a large amount of data so more mixtures are required?

Introduction Gaussian process Dirichlet process

Back to linear regression

Bayesian linear regression
A Gaussian prior on W A Gaussian posterior on W A Gaussian on prediction t

Correlation is related to how far the two x’s
A function view Treate y=ΦW as a random function The Gaussian on W leads to Gaussian distribution on the training data set Y=[y(x1),y(x2),y(x3),...y(xN)] We infact place a distribution on the function y. This distribution on function leads to a random function, and therefore is called a ‘random process’. The specification of the randomness on functions is difficult, and can be described on a finite set of samples. Here, it is Gaussian A stochastic process y(x) is specified by giving the joint probability distribution for any finite set of values y(x1), , y(xN) in a consistent manner. Correlation is related to how far the two x’s

Gaussian process For a random function y(x), if its any set of samples follow joint Gaussian distribution with kernel K, then y(x) is said to be a Gaussian process. Usually we use zero as the mean of the joint Gaussian, so the kernel k(x,y) determines the entire property. Note that the gramma matrix K is from all the training data. As we will see, the prediction depends on all the training data directly. So this model (prior) is non-parametric!

Samples from Gaussian prior
We can sample a function, by sampling the values at a limited number of locations (x1,x2,...xN) The sampling can be done one by one. For example, when sampling xi, ignoring (marginalizing) all xj (j<N),and condition on the sample that has been recognized.

Possible kernels

GP for regression GP prior with any K!

Prediction 𝑝 𝑡 𝑁+1 𝑡 1:𝑁 =𝑁(𝑚 𝑋 𝑁+1 , 𝜎 2 ( 𝑋 𝑁+1 ))

Example with two samples

Prediction confidence

Gaussian process for classification
In classification, the posterior probability function y=σ(wx)is limited, so not suitable as a Gaussian We instead set prior on the partial function a=wx

Prediction with Gaussian approximatoin
Use variational approach to solve the problem Use Laplace approximation

Introduction Gaussian process Dirichlet process

Back to Gaussian mixture model
π There are two randomness: z, and parameters μ and Σ. z N μ, Σ x

How if we don’t know K? The number of clusters should be pre-defined
π The number of clusters should be pre-defined It is empirical and requires knowledge From Bayesian ∫P(x|Θ)P(Θ)dΘ, it is too many items. How about select some of them? Let the data decide how many items, and where are they. z N μ, Σ x

Change a perspective for data generation
Before Sample K sets of parameters For each data, sample a indicator z, and then use the corresponding parameter to generate data Now For each data, sample a parameter, with some probability equals to previous values, and some probability to a new value (double randomness). Sample the data by the parameter

Sequential random sampling
Two random space: the ones that have been sampled already, the one with the basic (continuous) unknown ones

Chinese resturant process
A Chinese restaurant with an innite number of tables, each of which can seat an innite number of customers. Each customer select a seat, either occupied by others, or a new one

It’s not really infinite
It defines a probability on the number of tables!

Now think about the parameter generation
If we think the parameter generation process, instead of data generation process A stick-breaking Double randomness

Dirichlet process If we have a distribution H on prameter θ, we treat it as a function. We want the function random, i.e., let it to be process The process, again, should be defined in finite samples. Dirichlet process define the probability on any partition on θ to be a Dirichlet distribution, as follows, where G is a sampled distribution (like H): There are many works demonstrate it exists. But the most interesting one is the stick breaking.

Another definition

The equivalence of the two definitions
A constructive definition of Dirichlet Priors, Jayaram Sethuraman. Stick-breaking constructed distribution

Contribution of the concentration parameter

Posterior probability
Give some samples of θ, what the G should be like? Another DP! With a G, prediction for the next sample from G:

DP Mixture Sample a ‘discrete’ distribution G over θ
Sample Φ for each data, and sample data using parameter model defined by Φ By observing some Φ, we can know something about G By observing some X, we can know the posterior probability on G

Gibbs sampling Sampling the parameters to get the best partition and parameters.

Some applications DP Topic model, no limit on the number of clusters
Infinite state HMM Speaker diarization and scenary analysis, no limit on the number of speakers Theme detection Image segmentation ...

Wrap up We know some models or part of models can not be modeled by parameters, where non-parameteric methods are useful Gaussian process is a random function. The randomness on functions is defined as randomness on variables, which follow multivariant Gaussian. Dirichlet process is another random function, it is a random distribution. The randomness is defined on any partition. Sticking break and CRP are all good metaphor for the process. Non-parameteric Bayesian is a nice extension to the Bayesian framework, connecting random variables to random functions.

Non-Parametric Models

Similar presentations

Presentation on theme: "Non-Parametric Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Non-Parametric Models

Similar presentations

Presentation on theme: "Non-Parametric Models"— Presentation transcript:

Similar presentations

About project

Feedback