Non-Parametric Models CSLT ML Summer Seminar (9) Non-Parametric Models Dong Wang
Content Introduction Gaussian process Dirichlet process
Parameteric models Usually we hope a model M(X) that is represented by a set of parameters, in any form (discriminative, descriptive,...) Model parameters are optimized with any reasonable criterion, by any reasonable optimization approach. The parameterized model is then used for any inference task, including prediction, classification, generation, or any others It is simple, isn’t it?
But things are not so straightforward Parameterization means ‘knowledge’ on the model Sometimes we don’t have such knowledge, then better to do in a non-parametric way Density estimation Nearest neighbour SVM
Non-parametric models We say that a learning algorithm is non-parametric if the complexity of the functions it can learn is allowed to grow as the amount of training data is increased. YoshuaBengio - 18 Aug 2008
Two examples Bayesian linear regression Clustering We impose a Gaussian prior, which leads to a random prediction function Can we forget the parameter, and set prior on all the possible prediction functions? Clustering In k-NN and GMM, we need to define the number of mixtures. How if there are a large amount of data so more mixtures are required?
Introduction Gaussian process Dirichlet process
Back to linear regression
Bayesian linear regression A Gaussian prior on W A Gaussian posterior on W A Gaussian on prediction t
Correlation is related to how far the two x’s A function view Treate y=ΦW as a random function The Gaussian on W leads to Gaussian distribution on the training data set Y=[y(x1),y(x2),y(x3),...y(xN)] We infact place a distribution on the function y. This distribution on function leads to a random function, and therefore is called a ‘random process’. The specification of the randomness on functions is difficult, and can be described on a finite set of samples. Here, it is Gaussian A stochastic process y(x) is specified by giving the joint probability distribution for any finite set of values y(x1), . . . , y(xN) in a consistent manner. Correlation is related to how far the two x’s
Gaussian process For a random function y(x), if its any set of samples follow joint Gaussian distribution with kernel K, then y(x) is said to be a Gaussian process. Usually we use zero as the mean of the joint Gaussian, so the kernel k(x,y) determines the entire property. Note that the gramma matrix K is from all the training data. As we will see, the prediction depends on all the training data directly. So this model (prior) is non-parametric!
Samples from Gaussian prior We can sample a function, by sampling the values at a limited number of locations (x1,x2,...xN) The sampling can be done one by one. For example, when sampling xi, ignoring (marginalizing) all xj (j<N),and condition on the sample that has been recognized.
Possible kernels https://en.wikipedia.org/wiki/Gaussian_process
GP for regression GP prior with any K!
Prediction 𝑝 𝑡 𝑁+1 𝑡 1:𝑁 =𝑁(𝑚 𝑋 𝑁+1 , 𝜎 2 ( 𝑋 𝑁+1 ))
Example with two samples
Prediction confidence
Gaussian process for classification In classification, the posterior probability function y=σ(wx)is limited, so not suitable as a Gaussian We instead set prior on the partial function a=wx
Prediction with Gaussian approximatoin Use variational approach to solve the problem Use Laplace approximation
Introduction Gaussian process Dirichlet process
Back to Gaussian mixture model π There are two randomness: z, and parameters μ and Σ. z N μ, Σ x
How if we don’t know K? The number of clusters should be pre-defined π The number of clusters should be pre-defined It is empirical and requires knowledge From Bayesian ∫P(x|Θ)P(Θ)dΘ, it is too many items. How about select some of them? Let the data decide how many items, and where are they. z N μ, Σ x
Change a perspective for data generation Before Sample K sets of parameters For each data, sample a indicator z, and then use the corresponding parameter to generate data Now For each data, sample a parameter, with some probability equals to previous values, and some probability to a new value (double randomness). Sample the data by the parameter
Sequential random sampling Two random space: the ones that have been sampled already, the one with the basic (continuous) unknown ones
Chinese resturant process A Chinese restaurant with an innite number of tables, each of which can seat an innite number of customers. Each customer select a seat, either occupied by others, or a new one
It’s not really infinite It defines a probability on the number of tables!
Now think about the parameter generation If we think the parameter generation process, instead of data generation process A stick-breaking Double randomness
Dirichlet process If we have a distribution H on prameter θ, we treat it as a function. We want the function random, i.e., let it to be process The process, again, should be defined in finite samples. Dirichlet process define the probability on any partition on θ to be a Dirichlet distribution, as follows, where G is a sampled distribution (like H): There are many works demonstrate it exists. But the most interesting one is the stick breaking.
Another definition
The equivalence of the two definitions A constructive definition of Dirichlet Priors, Jayaram Sethuraman. http://www3.stat.sinica.edu.tw/statistica/oldpdf/A4n216.pdf Stick-breaking constructed distribution
Contribution of the concentration parameter
Posterior probability Give some samples of θ, what the G should be like? Another DP! With a G, prediction for the next sample from G:
DP Mixture Sample a ‘discrete’ distribution G over θ Sample Φ for each data, and sample data using parameter model defined by Φ By observing some Φ, we can know something about G By observing some X, we can know the posterior probability on G
Gibbs sampling Sampling the parameters to get the best partition and parameters.
Some applications DP Topic model, no limit on the number of clusters Infinite state HMM Speaker diarization and scenary analysis, no limit on the number of speakers Theme detection Image segmentation ...
Wrap up We know some models or part of models can not be modeled by parameters, where non-parameteric methods are useful Gaussian process is a random function. The randomness on functions is defined as randomness on variables, which follow multivariant Gaussian. Dirichlet process is another random function, it is a random distribution. The randomness is defined on any partition. Sticking break and CRP are all good metaphor for the process. Non-parameteric Bayesian is a nice extension to the Bayesian framework, connecting random variables to random functions.