Presentation is loading. Please wait.

Presentation is loading. Please wait.

Non-Parametric Models

Similar presentations


Presentation on theme: "Non-Parametric Models"— Presentation transcript:

1 Non-Parametric Models
CSLT ML Summer Seminar (9) Non-Parametric Models Dong Wang

2 Content Introduction Gaussian process Dirichlet process

3 Parameteric models Usually we hope a model M(X) that is represented by a set of parameters, in any form (discriminative, descriptive,...) Model parameters are optimized with any reasonable criterion, by any reasonable optimization approach. The parameterized model is then used for any inference task, including prediction, classification, generation, or any others It is simple, isn’t it?

4 But things are not so straightforward
Parameterization means ‘knowledge’ on the model Sometimes we don’t have such knowledge, then better to do in a non-parametric way Density estimation Nearest neighbour SVM

5 Non-parametric models
We say that a learning algorithm is non-parametric if the complexity of the functions it can learn is allowed to grow as the amount of training data is increased. YoshuaBengio - 18 Aug 2008

6 Two examples Bayesian linear regression Clustering
We impose a Gaussian prior, which leads to a random prediction function Can we forget the parameter, and set prior on all the possible prediction functions? Clustering In k-NN and GMM, we need to define the number of mixtures. How if there are a large amount of data so more mixtures are required?

7 Introduction Gaussian process Dirichlet process

8 Back to linear regression

9 Bayesian linear regression
A Gaussian prior on W A Gaussian posterior on W A Gaussian on prediction t

10 Correlation is related to how far the two x’s
A function view Treate y=ΦW as a random function The Gaussian on W leads to Gaussian distribution on the training data set Y=[y(x1),y(x2),y(x3),...y(xN)] We infact place a distribution on the function y. This distribution on function leads to a random function, and therefore is called a ‘random process’. The specification of the randomness on functions is difficult, and can be described on a finite set of samples. Here, it is Gaussian A stochastic process y(x) is specified by giving the joint probability distribution for any finite set of values y(x1), , y(xN) in a consistent manner. Correlation is related to how far the two x’s

11 Gaussian process For a random function y(x), if its any set of samples follow joint Gaussian distribution with kernel K, then y(x) is said to be a Gaussian process. Usually we use zero as the mean of the joint Gaussian, so the kernel k(x,y) determines the entire property. Note that the gramma matrix K is from all the training data. As we will see, the prediction depends on all the training data directly. So this model (prior) is non-parametric!

12 Samples from Gaussian prior
We can sample a function, by sampling the values at a limited number of locations (x1,x2,...xN) The sampling can be done one by one. For example, when sampling xi, ignoring (marginalizing) all xj (j<N),and condition on the sample that has been recognized.

13

14 Possible kernels

15 GP for regression GP prior with any K!

16 Prediction 𝑝 𝑡 𝑁+1 𝑡 1:𝑁 =𝑁(𝑚 𝑋 𝑁+1 , 𝜎 2 ( 𝑋 𝑁+1 ))

17 Example with two samples

18 Prediction confidence

19 Gaussian process for classification
In classification, the posterior probability function y=σ(wx)is limited, so not suitable as a Gaussian We instead set prior on the partial function a=wx

20 Prediction with Gaussian approximatoin
Use variational approach to solve the problem Use Laplace approximation

21 Introduction Gaussian process Dirichlet process

22 Back to Gaussian mixture model
π There are two randomness: z, and parameters μ and Σ. z N μ, Σ x

23 How if we don’t know K? The number of clusters should be pre-defined
π The number of clusters should be pre-defined It is empirical and requires knowledge From Bayesian ∫P(x|Θ)P(Θ)dΘ, it is too many items. How about select some of them? Let the data decide how many items, and where are they. z N μ, Σ x

24 Change a perspective for data generation
Before Sample K sets of parameters For each data, sample a indicator z, and then use the corresponding parameter to generate data Now For each data, sample a parameter, with some probability equals to previous values, and some probability to a new value (double randomness). Sample the data by the parameter

25 Sequential random sampling
Two random space: the ones that have been sampled already, the one with the basic (continuous) unknown ones

26 Chinese resturant process
A Chinese restaurant with an innite number of tables, each of which can seat an innite number of customers. Each customer select a seat, either occupied by others, or a new one

27 It’s not really infinite
It defines a probability on the number of tables!

28 Now think about the parameter generation
If we think the parameter generation process, instead of data generation process A stick-breaking Double randomness

29 Dirichlet process If we have a distribution H on prameter θ, we treat it as a function. We want the function random, i.e., let it to be process The process, again, should be defined in finite samples. Dirichlet process define the probability on any partition on θ to be a Dirichlet distribution, as follows, where G is a sampled distribution (like H): There are many works demonstrate it exists. But the most interesting one is the stick breaking.

30 Another definition

31 The equivalence of the two definitions
A constructive definition of Dirichlet Priors, Jayaram Sethuraman. Stick-breaking constructed distribution

32 Contribution of the concentration parameter

33 Posterior probability
Give some samples of θ, what the G should be like? Another DP! With a G, prediction for the next sample from G:

34 DP Mixture Sample a ‘discrete’ distribution G over θ
Sample Φ for each data, and sample data using parameter model defined by Φ By observing some Φ, we can know something about G By observing some X, we can know the posterior probability on G

35 Gibbs sampling Sampling the parameters to get the best partition and parameters.

36 Some applications DP Topic model, no limit on the number of clusters
Infinite state HMM Speaker diarization and scenary analysis, no limit on the number of speakers Theme detection Image segmentation ...

37 Wrap up We know some models or part of models can not be modeled by parameters, where non-parameteric methods are useful Gaussian process is a random function. The randomness on functions is defined as randomness on variables, which follow multivariant Gaussian. Dirichlet process is another random function, it is a random distribution. The randomness is defined on any partition. Sticking break and CRP are all good metaphor for the process. Non-parameteric Bayesian is a nice extension to the Bayesian framework, connecting random variables to random functions.


Download ppt "Non-Parametric Models"

Similar presentations


Ads by Google