Presentation is loading. Please wait.

Presentation is loading. Please wait.

3(+1) classifiers from the Bayesian world

Similar presentations


Presentation on theme: "3(+1) classifiers from the Bayesian world"— Presentation transcript:

1 3(+1) classifiers from the Bayesian world
21/03/2017

2 Bayes classifier Bayes decision theory Bayes classifier
P(j | x) = P(x | j ) · P (j ) / P(x) Discriminant function in the case of normal likelihood Parameter estimation: Form of the density is known The density is defined by a few parameters Estimate the parameters from data

3 Example ? age debit income leave? <21 none < 50K yes 21-50
50K-200K 50< no 200K< ?

4 Naϊve Bayes

5 Naϊve Bayes The Naive Bayes classifier is a Bayes classifier where we assume the conditional independence of the features:

6 Naϊve Bayes Two-category case
x = [x1, x2, …, xd ]t where each xi is binary and pi = P(xi = 1 | 1) qi = P(xi = 1 | 2)

7

8 Training Naive Bayes - MLE
Goal: estimate pi = P(xi = 1 | 1 ) and qi = P(xi = 1 | 2 ) based on N training samples assume that pi and qi are binomial Maximum Likelihood estimation is:

9 Training Naive Bayes – Bayes estimation
Beta distribution X ~ Beta(a,b) E [X]=1/(1+b/a)

10 Training Naive Bayes – Bayes estimation
assume that pi and qi are binomial we use Beta distribution to represent uncertainty of the estimation … 2 steps of Bayes estimation …

11 Training Naive Bayes – Bayes estimation (m-estimate)
in practice: avoding 0 likelihood/posteriori m and p are constants (metaparameters) p is the prior guess for each pi m is the „equivalent sample size”

12 Naϊve Bayes in practice
not so naive  fast, easily distributable low memory good choice if there are many features and potentially each feature can contribute to the solution

13 Example ? P() age debit income leave? <21 none < 50K yes 21-50
50K-200K 50< no 200K< ? P(age>50|  =yes) = (0+mp) / 2+m P(none debit|  =yes) P(200K<income|  =yes)

14 Generative vs. Discriminative Classifiers
Modeling the data belonging to each class, i.e. how they are generated Bayes: likelihood P(x | j ) and apriori P(j ) are estimated Discriminative: Goal is the discrimination of classes Bayes: direct estimation of the posteriori P(j | x) x1 x2 x3 x1 x2 x3

15 Logistic Regression (Maximum Entropy Classifier)
Two-category case: Training (MLE):

16 Non-parametric Bayes classifiers

17 Non-parametric estimation of densities
17 Non-parametric estimation of densities Non-parametric estimation techniques do not assume the form the density Bayes classifier: non-parametric estimation of likelihood P(x | j ), i.e. generative or directly the posteriori P(j | x), i.e. discriminative

18 Non-parametric estimation
18 Non-parametric estimation estimate p(x) Probability that a vector x will fall in region R is: P is a smoothed (or averaged) version of the density function p(x) if we have a sample of size n; therefore, the expected value that k points fall in R is then: k = nP Pattern Classification, Chapter 2 (Part 1)

19 Non-parametric estimation
19 Non-parametric estimation applying MLE for P: p(x) is continuous and that the region R is so small that p does not vary significantly within it: where is a point within R and V the volume enclosed by R.

20 Convergence of non-parametric estimations
20 Convergence of non-parametric estimations The volume V needs to approach 0 anyway if we want to use this estimation Practically, V cannot be allowed to become small since the number of samples is always limited

21 Convergence of non-parametric estimations
21 Three necessary conditions should apply if we want pn(x) to converge to p(x):

22 22

23 Parzen windows the volume and the form of R is fixed V is constant (n is constant) p(x) is estimated by the count down of points of the training sample in R around x =k

24 Parzen windows- hypercube
24 Parzen windows- hypercube R is a d-dimensional hypercube ((x-xi)/hn) is equal to unity if xi falls within the hypercube of volume Vn centered at x and equal to zero otherwise. ( is called a kernel)

25 Parzen windows- hypercube
25 Parzen windows- hypercube number of samples in this hypercube:

26 Generalized Parzen Windows
pn(x) estimates p(x) like the average of a distance between x and (xi) (i = 1,… ,n) samples  can be any function between x and xi

27 Parzen windows - example
27 Parzen windows - example p(x) ~ N(0,1) (u) = (1/(2) exp(-u2/2) and hn = h1/n (n>1)

28 28

29 29

30 30

31 31

32 32 p(x) ?

33 33 real generator: p(x) = 1U(a,b) + 2T(c,d) (mixture of an uniform and a triangular density)

34 Parzen windows as classifiers
34 Parzen Windows are used for modelling/estimation of the multidimensional likelihood Generative classifier The decision surface/regions are highly depend on the kernel and kernel length

35 35

36 Example ? P() age debit income leave? <21 none < 50K yes 21-50
50K-200K 50< no 200K< ? P(age>50, none debit, 200K<income |  =yes) = ? can be the number of feature where tha values of x and xi are different

37 k nearest neighbor estimation
37 k nearest neighbor estimation a solution for the problem of the unknown “best” window function: Let the cell volume be a function of the training data Center a cell about x and let it grows until it captures kn samples (kn = f(n)) kn are called the kn nearest-neighbors of x 2 lehetőség van: Nagy a sűrűség x közelében; ekkor a cella kicsi lesz, és így a felbontás jó lesz Sűrűség kicsi; ekkor a cella nagyra fog nőni, és akkor áll le, amikor nagy sűrűségű tartományt ér el A becslések egy családját kaphatjuk a kn=k1/n választással, a k1 különböző választásai mellett

38 38 © Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)

39 k nearest neighbor classifier (knn)
39 P(i | x) direct estimation form n training samples take the smallest R around x which includes k samples out of n if ki out of k is labeled by i : pn(x, i) = ki /(nV)

40 k nearest neighbor classifier (knn)
40 ki /k is the fraction of the samples within the cell that are labeled i For minimum error rate, the most frequently represented category within the cell is selected If k is large and the cell sufficiently small, the performance will approach the best possible

41 41

42 Példa ? age debit income leave? <21 none < 50K yes 21-50
50K-200K 50< no 200K< ? k=3 Distance metric = how many features are different

43 Non-parametric classifiers
They HAVE got parameters! Non-parametric classifiers are Bayes classifiers which use non-parametric denstiy estimation approaches Parzen-windows classifier kernel and h length genarative k nearest neighbor classifier distance metric and k discriminative

44 about distance metrics

45

46 Bayes classifiers in practice
Summary Bayes classifiers in practice Parametric Non-parametric Generative (estimation of the likelihood) Naive Bayes Parzen windows classifier Discriminative (direct estimation of the Posteriori) Logistic Regression k nearest neighbor classifier


Download ppt "3(+1) classifiers from the Bayesian world"

Similar presentations


Ads by Google