Presentation is loading. Please wait.

Presentation is loading. Please wait.

unit #3 Neural Networks and Pattern Recognition

Similar presentations


Presentation on theme: "unit #3 Neural Networks and Pattern Recognition"— Presentation transcript:

1 unit #3 Neural Networks and Pattern Recognition
Giansalvo EXIN Cirrincione

2 PROBABILITY DENSITY ESTIMATION
labelled unlabelled A specific functional form for the density model is assumed. This contains a number of parameters which are then optimized by fitting the model to the training set. The chosen form is not correct

3 PROBABILITY DENSITY ESTIMATION
It does not assume a particular functional form, but allows the form of the density to be determined entirely by the data. The number of parameters grows with the size of the TS

4 PROBABILITY DENSITY ESTIMATION
It allows a very general class of functional forms in which the number of adaptive parameters can be increased in a sistematic way to build even more flexible models, but where the total number of parameters in the model can be varied independently from the size of the data set.

5 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution

6 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution Mahalanobis distance

7 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution contour of constant probability density (smaller by a factor exp(-1/2))

8 The components of x are statistically independent
Parametric model: normal or Gaussian distribution The components of x are statistically independent

9 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution

10 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution Some properties : any moment can be expressed as a function of m and S under general assumptions, the mean of M random variables tends to be distributed normally, in the limit as M tends to infinity (central limit theorem). Example: sum of a set of variables drawn independently from the same distribution under any non-singular linear transformation of the coordinate system, the pdf is again normal, but with different parameters the marginal and conditional densities are normal.

11 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution discriminant function independent normal class-conditional pdf’s quadratic decision boundary

12 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution independent normal class-conditional pdf’s S k = S linear decision boundary

13 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution P(C1) = P(C2)

14 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution P(C1) = P(C2) = P(C3)

15 normal or Gaussian distribution
Parametric model: normal or Gaussian distribution S = s2I template matching

16 ML finds the optimum values for the parameters by maximizing a likelihood function derived from the training data. drawn independently from the required distribution

17 ML finds the optimum values for the parameters by maximizing a likelihood function derived from the training data. TS joint probability density Likelihood of  for the given TS

18 error function Gaussian pdf homework sample averages

19 Uncertainty in the values of the parameters

20

21 weighting factor (posterior distribution)
drawn independently from the underlying distribution

22

23

24 For large numbers of observations, the Bayesian representation of the density approaches the maximum likelihood solution. A prior which gives rise to a posterior having the same functional form is said to be a conjugate prior (reproducing densities, e.g. Gaussian).

25 Example homework Assume s known Find m given  normal distribution
sample mean normal distribution

26 Example normal distribution

27 batch sequential Iterative techniques: no storage of a complete TS
on-line learning in real-time adaptive systems tracking of slowly varying systems batch sequential From the ML estimate of the mean of a normal distribution

28 The Robbins-Monro algorithm
Consider a pair of random variables g and  which are correlated regression function Assume g has finite variance:

29 The Robbins-Monro algorithm
Successive corrections decrease in magnitude for convergence positive Corrections are sufficiently large that the root is found The accumulated noise has finite variance (noise doesn’t spoil convergence )

30 The Robbins-Monro algorithm
The ML parameter estimate  can be formulated as a sequential update method using the Robbins-Monro formula.

31 homework

32 Consider the case where the pdf is taken to be a normal distribution, with known standard deviation s and unknown mean m. Show that, by choosing aN = s2 / (N+1), the one-dimensional iterative version of the ML estimate of the mean is recovered by using the Robbins-Monro formula for sequential ML. Obtain the corresponding formula for the iterative estimate of s2 and repeat the same analysis.

33 SUPERVISED LEARNING histograms We can choose both the number of bins M and their starting position on the axis. The number of bins (viz. the bin width) acts as a smoothing parameter. Curse of dimensionality ( Md bins)

34 Density estimation in general
The probability that a new vector x, drawn from the unknown pdf p(x), will fall inside some region R of x-space is given by: If we have N points drawn independently from p(x), the probability that K of them will fall within R is given by the binomial law: The distribution is sharply peaked as N tends to infinity. Assume p(x) is continuous and slightly varies over the region R of volume V.

35 Density estimation in general
trade-off Assumption #1 R relatively large so that P will be large and the binomial distribution will be sharply peaked Assumption #2 R small justifies the assumption of p(x) nearly constant inside the integration region. FIXED DETERMINED FROM DATA K-nearest-neighbours

36 Density estimation in general
trade-off Assumption #1 R relatively large so that P will be large and the binomial distribution will be sharply peaked Assumption #2 R small justifies the assumption of p(x) nearly constant inside the integration region. DETERMINED FROM DATA FIXED Kernel-based methods

37 interpolation function (ZOH)
Kernel-based methods R is a hypercube centred on x We can find an expression for K by defining a kernel function H(u), also known as a Parzen window, given by: interpolation function (ZOH) Superposition of N cubes of side h with each cube centred on one of the data points.

38 Kernel-based methods smoother estimate

39 Kernel-based methods 30 samples ZOH Gaussian

40 Over different selections of data points xn
Kernel-based methods Over different selections of data points xn The expectation of the estimated density is a convolution of the true pdf with the kernel function and so represents a smoothed version of the pdf. All of the data points must be stored ! For a finite data set, there is no non-negative estimator which is unbiased for all continuous pdf’s (Rosenblatt, 1956)

41 K-nearest neighbours Consider a small hypersphere centred at a point x and allow the radius of the sphere to grow until it contains precisely K data points. The estimate of the density is then given by K / NV. The optimum choice of h may be a function of position. One of the potential problems with the kernel-based approach arises from the use of a fixed width parameter (h) for all of the data points. If h is too large, there may be regions of x-space in which the estimate is oversmoothed. Reducing h may lead to problems in regions of lower density where the model density will become noisy.

42 K-nearest neighbours The estimate is not a true probability density since its integral over all x-space diverges. All of the data points must be stored ! Branch-and-bound

43 The data set contains Nk points in class Ck and N points in total.
K-nearest neighbour classification rule The data set contains Nk points in class Ck and N points in total. Draw a hypersphere around x which encompasses K points irrespective of their class.

44 K-nearest neighbour classification rule
K = 1 : nearest-neighbour rule Find a hypersphere around x which contains K points and then assign x to the class having the majority inside the hypersphere.

45 K-nearest neighbour classification rule
K = 1 : nearest-neighbour rule Samples that are close in feature space likely belong to the same class.

46 K-nearest neighbour classification rule
1-NNR

47 Measure of the distance between two density functions
L  0 with equality iff the two pdf’s are equal. Measure of the distance between two density functions Kullback-Leibler distance or asymmetric divergence

48 homework

49

50

51

52 Techniques not restricted to specific functional forms, where the size of the model only grows with the complexity of the problem being solved, and not simply with the size of the data set. computationally intensive MIXTURE MODEL Training methods based on ML: nonlinear optimization re-estimation (EM algorithm) stochastic sequential estimation

53 MIXTURE DISTRIBUTION mixing parameters prior probability of the data point having been generated from component j of the mixture

54 incomplete data (no component label) To generate a data from the pdf, one of the components j is first selected at random with probability P(j) and then a data point is generated from the corresponding component density p(xj). It can approximate any CONTINUOUS density to arbitrary accuracy provided the model has a sufficiently large number of components, and provided the parameters of the model are chosen correctly.

55 posterior probability

56 spherical Gaussian mM1 m1d

57 components collapses onto
MAXIMUM LIKELIHOOD Adjustable parameters : P( j ) mj j = 1, … , M sj j = 1, … , M One of the Gaussian components collapses onto one of the data points Problems : singular solutions (likelihood goes to infinity) local minima

58 MAXIMUM LIKELIHOOD Possible solutions : constrain the components to have equal variance minimum (underflow) threshold for the variance Problems : singular solutions (likelihood goes to infinity) local minima

59 softmax or normalized exponential

60 Expressions for the parameters at a minimum of E
Mean of the data vectors weighted by the posterior probabilities that the corresponding data points were generated from that component. Expressions for the parameters at a minimum of E

61 Expressions for the parameters at a minimum of E
Variance of the data w.r.t. the mean of that component, again weighted with the posterior probabilities. Expressions for the parameters at a minimum of E

62 Expressions for the parameters at a minimum of E
Posterior probabilities for that component, averaged over the data set. Expressions for the parameters at a minimum of E

63 non-linear coupled equations
Highly non-linear coupled equations Expressions for the parameters at a minimum of E

64 Expectation-maximization (EM) algorithm
old new The error function decreases at each iteration until a local minimum is found Expectation-maximization (EM) algorithm

65 proof Jensen’s inequality
Given a set of non-negative numbers l j that sum to one : Jensen’s inequality

66 Minimizing Q leads to a decrease in the value of the Enew unless Enew is already at a local minimum.

67 Gaussian mixture model
Minimize : end proof

68 example EM algorithm 1000 data points uniform distribution
after 20 cycles after 20 cycles Contours of constant probability density EM algorithm 1000 data points uniform distribution seven components

69 Why expectation-maximization ?
Hypothetical complete data set  xn introduce zn , integer in the range (1,M), specifying which component of the mixture generated x. The distribution of zn is unknown

70 Why expectation-maximization ?
First we guess some values for the parameters of the mixture model (the old parameter values) and then we use these, together with Bayes’ theorem, to find the probability distribution of the {zn}. We then compute the expectation of Ecomp w.r.t. this distribution. This is the E-step of the EM algorithm. The new parameter values are then found by minimizing this expected error w.r.t. the parameters. This is the maximization or M-step of the EM algorithm (min E = ML).

71 Why expectation-maximization ?
Pold(zn|xn) is the probability for zn, given the value of xn and the old parameter values. Thus, the expectation of Ecomp over the complete set of {zn} values is given by: probability distribution for the {zn}

72 homework Why expectation-maximization ?
Pold(zn|xn) is the probability for zn, given the value of xn and the old parameter values. Thus, the expectation of Ecomp over the complete set of {zn} values is given by: homework

73 ~ which is equal to Q Why expectation-maximization ?
Pold(zn|xn) is the probability for zn, given the value of xn and the old parameter values. Thus, the expectation of Ecomp over the complete set of {zn} values is given by: which is equal to Q ~

74 It requires the storage of all previous data points
Stochastic estimation of parameters It requires the storage of all previous data points

75 no singular solutions in on-line problems
Stochastic estimation of parameters no singular solutions in on-line problems

76 single-layer networks
Next unit : single-layer networks

77 FINE


Download ppt "unit #3 Neural Networks and Pattern Recognition"

Similar presentations


Ads by Google