3(+1) classifiers from the Bayesian world

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Principles of Density Estimation
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Giansalvo EXIN Cirrincione unit #3 PROBABILITY DENSITY ESTIMATION labelled unlabelled A specific functional form for the density model is assumed. This.
INTRODUCTION TO Machine Learning 3rd Edition
Lecture 3 Nonparametric density estimation and classification
Pattern recognition Professor Aly A. Farag
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 4 (Part 1): Non-Parametric Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 4 (part 2): Non-Parametric Classification
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
unit #3 Neural Networks and Pattern Recognition
Lecture 2. Bayesian Decision Theory
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Non-Parameter Estimation
INTRODUCTION TO Machine Learning 3rd Edition
Ch8: Nonparametric Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
School of Computer Science & Engineering
Outline Maximum Likelihood Maximum A-Posteriori (MAP) Estimation
Overview of Supervised Learning
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Nonparametric methods Parzen window and nearest neighbor
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 16: NONPARAMETRIC TECHNIQUES
LECTURE 09: BAYESIAN LEARNING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Nonparametric density estimation and classification
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen, 2005 References:
Hairong Qi, Gonzalez Family Professor
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Bayesian Decision Theory
Presentation transcript:

3(+1) classifiers from the Bayesian world 21/03/2017

Bayes classifier Bayes decision theory Bayes classifier P(j | x) = P(x | j ) · P (j ) / P(x) Discriminant function in the case of normal likelihood Parameter estimation: Form of the density is known The density is defined by a few parameters Estimate the parameters from data

Example ? age debit income leave? <21 none < 50K yes 21-50 50K-200K 50< no 200K< ?

Naϊve Bayes

Naϊve Bayes The Naive Bayes classifier is a Bayes classifier where we assume the conditional independence of the features:

Naϊve Bayes Two-category case x = [x1, x2, …, xd ]t where each xi is binary and pi = P(xi = 1 | 1) qi = P(xi = 1 | 2)

Training Naive Bayes - MLE Goal: estimate pi = P(xi = 1 | 1 ) and qi = P(xi = 1 | 2 ) based on N training samples assume that pi and qi are binomial Maximum Likelihood estimation is:

Training Naive Bayes – Bayes estimation Beta distribution X ~ Beta(a,b) E [X]=1/(1+b/a)

Training Naive Bayes – Bayes estimation assume that pi and qi are binomial we use Beta distribution to represent uncertainty of the estimation … 2 steps of Bayes estimation …

Training Naive Bayes – Bayes estimation (m-estimate) in practice: avoding 0 likelihood/posteriori m and p are constants (metaparameters) p is the prior guess for each pi m is the „equivalent sample size”

Naϊve Bayes in practice not so naive  fast, easily distributable low memory good choice if there are many features and potentially each feature can contribute to the solution

Example ? P() age debit income leave? <21 none < 50K yes 21-50 50K-200K 50< no 200K< ? P(age>50|  =yes) = (0+mp) / 2+m P(none debit|  =yes) P(200K<income|  =yes)

Generative vs. Discriminative Classifiers Modeling the data belonging to each class, i.e. how they are generated Bayes: likelihood P(x | j ) and apriori P(j ) are estimated Discriminative: Goal is the discrimination of classes Bayes: direct estimation of the posteriori P(j | x)  x1 x2 x3  x1 x2 x3

Logistic Regression (Maximum Entropy Classifier) Two-category case: Training (MLE):

Non-parametric Bayes classifiers

Non-parametric estimation of densities 17 Non-parametric estimation of densities Non-parametric estimation techniques do not assume the form the density Bayes classifier: non-parametric estimation of likelihood P(x | j ), i.e. generative or directly the posteriori P(j | x), i.e. discriminative

Non-parametric estimation 18 Non-parametric estimation estimate p(x) Probability that a vector x will fall in region R is: P is a smoothed (or averaged) version of the density function p(x) if we have a sample of size n; therefore, the expected value that k points fall in R is then: k = nP Pattern Classification, Chapter 2 (Part 1)

Non-parametric estimation 19 Non-parametric estimation applying MLE for P: p(x) is continuous and that the region R is so small that p does not vary significantly within it: where is a point within R and V the volume enclosed by R.

Convergence of non-parametric estimations 20 Convergence of non-parametric estimations The volume V needs to approach 0 anyway if we want to use this estimation Practically, V cannot be allowed to become small since the number of samples is always limited

Convergence of non-parametric estimations 21 Three necessary conditions should apply if we want pn(x) to converge to p(x):

22

Parzen windows the volume and the form of R is fixed V is constant (n is constant) p(x) is estimated by the count down of points of the training sample in R around x =k

Parzen windows- hypercube 24 Parzen windows- hypercube R is a d-dimensional hypercube ((x-xi)/hn) is equal to unity if xi falls within the hypercube of volume Vn centered at x and equal to zero otherwise. ( is called a kernel)

Parzen windows- hypercube 25 Parzen windows- hypercube number of samples in this hypercube:

Generalized Parzen Windows pn(x) estimates p(x) like the average of a distance between x and (xi) (i = 1,… ,n) samples  can be any function between x and xi

Parzen windows - example 27 Parzen windows - example p(x) ~ N(0,1) (u) = (1/(2) exp(-u2/2) and hn = h1/n (n>1)

28

29

30

31

32 p(x) ?

33 real generator: p(x) = 1U(a,b) + 2T(c,d) (mixture of an uniform and a triangular density)

Parzen windows as classifiers 34 Parzen Windows are used for modelling/estimation of the multidimensional likelihood Generative classifier The decision surface/regions are highly depend on the kernel and kernel length

35

Example ? P() age debit income leave? <21 none < 50K yes 21-50 50K-200K 50< no 200K< ? P(age>50, none debit, 200K<income |  =yes) = ? can be the number of feature where tha values of x and xi are different

k nearest neighbor estimation 37 k nearest neighbor estimation a solution for the problem of the unknown “best” window function: Let the cell volume be a function of the training data Center a cell about x and let it grows until it captures kn samples (kn = f(n)) kn are called the kn nearest-neighbors of x 2 lehetőség van: Nagy a sűrűség x közelében; ekkor a cella kicsi lesz, és így a felbontás jó lesz Sűrűség kicsi; ekkor a cella nagyra fog nőni, és akkor áll le, amikor nagy sűrűségű tartományt ér el A becslések egy családját kaphatjuk a kn=k1/n választással, a k1 különböző választásai mellett

38 © Ethem Alpaydin: Introduction to Machine Learning. 2nd edition (2010)

k nearest neighbor classifier (knn) 39 P(i | x) direct estimation form n training samples take the smallest R around x which includes k samples out of n if ki out of k is labeled by i : pn(x, i) = ki /(nV)

k nearest neighbor classifier (knn) 40 ki /k is the fraction of the samples within the cell that are labeled i For minimum error rate, the most frequently represented category within the cell is selected If k is large and the cell sufficiently small, the performance will approach the best possible

41

Példa ? age debit income leave? <21 none < 50K yes 21-50 50K-200K 50< no 200K< ? k=3 Distance metric = how many features are different

Non-parametric classifiers They HAVE got parameters! Non-parametric classifiers are Bayes classifiers which use non-parametric denstiy estimation approaches Parzen-windows classifier kernel and h length genarative k nearest neighbor classifier distance metric and k discriminative

about distance metrics

Bayes classifiers in practice Summary Bayes classifiers in practice Parametric Non-parametric Generative (estimation of the likelihood) Naive Bayes Parzen windows classifier Discriminative (direct estimation of the Posteriori) Logistic Regression k nearest neighbor classifier