Presentation is loading. Please wait.

Presentation is loading. Please wait.

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Similar presentations


Presentation on theme: "2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)"— Presentation transcript:

1

2 2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi) – Unfortunately, in pattern recognition applications we rarely have this kind of complete knowledge about the probabilistic structure of the problem. 1 Parameter Estimation

3 3 – We have a number of design samples or training data. – The problem is to find some way to use this information to design or train the classifier. – One approach: – Use the samples to estimate the unknown probabilities and probability densities, – And then use the resulting estimates as if they were the true values. 1 Parameter Estimation

4 4 – We have a number of design samples or training data. – The problem is to find some way to use this information to design or train the classifier. – One approach: – Use the samples to estimate the unknown probabilities and probability densities, – And then use the resulting estimates as if they were the true values. 1 Parameter Estimation

5 5 – What you are going to estimate in your Homework ? 1 Parameter Estimation

6 6 – If we know the number parameters in advance and our general knowledge about the problem permits us to parameterize the conditional desities then severity of the problem can be reduced significantly. 1 Parameter Estimation

7 7 – For example: – We can reasonaby assume that the p(x|wi) is a normal density with mean µ and covariance matrix ∑, – We do not know the exact values of these quantities, – However, this knowledge simplifies the problem from one of estimating an unknown function p(x|wi) to one of estimating the parameters the mean µ i and covariance matrix ∑ i – Estimating p(x|wi)  estimating µ i and ∑ i 1 Parameter Estimation

8 8 – Data availability in a Bayesian framework We could design an optimal classifier if we knew: – P(  i ) (priors) – P(x |  i ) (class-conditional densities) Unfortunately, we rarely have this complete information! – Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation (large dimension of feature space!) 1 Parameter Estimation

9 9 1 – Given a bunch of data from each class how to estimate the parameters of class conditional densities, P(x |  i ) ? – Ex: P(x |  i ) = N(  j,  j ) is Normal. Parameters  j = (  j,  j) Parameter Estimation

10 Two major approaches Maximum-Likelihood Method Bayesian Method – Use P(  i | x) for our classification rule! – Results are nearly identical, but the approaches are different

11 Maximum-Likelihood vs. Bayesian: Maximum Likelihood Parameters are fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayes Parameters are random variables having some known distribution Best parameters are obtained by estimating them given the data 11 1

12 Major assumptions – A priori information P(  i ) for each category is available – Samples are i.i.d. and P(x |  i ) is Normal P(x |  i ) ~ N(  i,  i ) – Note: Characterized by 2 parameters 12 1

13 Has good convergence properties as the sample size increases Simpler than any other alternative techniques 13 2 Maximum-Likelihood Estimation

14 14 2

15 General principle Assume we have c classes and The samples in class j have been drawn according to the probability law p(x |  j ) p(x |  j ) ~ N(  j,  j ) p(x |  j )  P (x |  j,  j ) where: 15 2 Maximum-Likelihood Estimation

16 Our problem is to use the information provided by the training samples to obtain good estimates for the unknown parameter vectors associated with each category. 16 2 Maximum-Likelihood Estimation

17 Likelihood Function Use the information provided by the training samples to estimate  = (  1,  2, …,  c ), each  i (i = 1, 2, …, c) is associated with each category Suppose that D contains n samples, x 1, x 2,…, x n 17

18 Goal: find an estimate Find  which maximizes P(D |  ) “It is the value of  that best agrees with the actually observed training sample” 18

19

20 Maximize log likelihood function: l(  ) = ln P(D |  ) Let  = (  1,  2, …,  p ) t and let   be the gradient operator New problem statement: Determine  that maximizes the log-likelihood 20

21 21 Set of necessary conditions for an optimum is:   l = 0 2

22 22 Example of a specific case: unknown  – P(x i |  ) ~ N( ,  ) (Samples are drawn from a multivariate normal population)  =  therefore: The ML estimate for  must satisfy:

23 23 Multiplying by  and rearranging, we obtain: Just the arithmetic average of the samples of the training samples! Conclusion: If P(x k |  j ) (j = 1, 2, …, c) is supposed to be Gaussian in a d-dimensional feature space; then we can estimate the vector  = (  1,  2, …,  c ) t and perform an optimal classification!

24 24 ML Estimation: – Gaussian Case: unknown  and   = (  1,  2 ) = ( ,  2 )

25 25 Summation: Combining (1) and (2), one obtains:

26 Example 1: Consider an exponential distribution otherwise (single feature, single parameter) With a random sample Estimate θ ?

27 Example 1: Consider an exponential distribution otherwise (single feature, single parameter) With a random sample : valid for (inverse of average)

28 Example 2: Multivariate Gaussian with unknown mean vector M. Assume is known. k samples from the same distribution: (iid) (linear algebra)

29 (sample average or sample mean)

30 Example 3: Binary variables with unknown parameters (n parameters) How to estimate these Think about your homework ?

31 Example 3: Binary variables with unknown parameters (n parameters) So, k samples here is the element of sample.

32 So,  is the sample average of the feature.

33 Since is binary, will be the same as counting the occurances of ‘1’. Consider character recognition problem with binary matrices. For each pixel, count the number of 1’s and this is the estimate of. AB

34 References R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, New York: John Wiley, 2001. Selim Aksoy, Pattern Recognition Course Materials, 2011. M. Narasimha Murty, V. Susheela Devi, Pattern Recognition an Algorithmic Approach, Springer, 2011. “Bayes Decision Rule Discrete Features”, http://www.cim.mcgill.ca/~friggi/bayes/decision/binaryfeatures.html, access: 2011. http://www.cim.mcgill.ca/~friggi/bayes/decision/binaryfeatures.html


Download ppt "2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)"

Similar presentations


Ads by Google