Computational Intelligence: Methods and Applications

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

Pattern Recognition and Machine Learning
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning CMPT 726 Simon Fraser University
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
Computational Intelligence: Methods and Applications Lecture 3 Histograms and probabilities. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Computational Intelligence: Methods and Applications Lecture 21 Linear discrimination, linear machines Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Lecture 2. Bayesian Decision Theory
Computational Intelligence: Methods and Applications
Deep Feedforward Networks
Computational Intelligence: Methods and Applications
LECTURE 04: DECISION SURFACES
Computational Intelligence: Methods and Applications
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Pattern Classification, Chapter 3
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Computational Intelligence: Methods and Applications
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
Computational Intelligence: Methods and Applications
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Bayesian Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Multivariate Methods Berlin Chen
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Computational Intelligence: Methods and Applications
Multivariate Methods Berlin Chen, 2005 References:
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Computational Intelligence: Methods and Applications
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 13 Bayesian risks & Naive Bayes approach Włodzisław Duch Dept. of Informatics, UMK Google: W Duch (c) 1999. Tralvex Yeap. All Rights Reserved

Bayesian future The February 2004 issue of MIT Technology magazine showcases Bayesian Machine Learning as one of the 10 emerging technologies that will change your world! Intel wants computers to be more proactive, learn from their experiences with users and the world around them. Intel Open Source Machine Learning library OpenML, and Audio-Visual Speech Recognition . Now are part of: Open Source Computer Vision Library http://sourceforge.net/projects/opencvlibrary http://opencvlibrary.sourceforge.net/ (c) 1999. Tralvex Yeap. All Rights Reserved

Total risks Confusion matrix: Total risk of the decision procedure Ĉ: If all Pkl=0 for k≠l the risk is zero (assuming lii =0). Minimizing risk means that we try to find decision procedure Ĉ that minimizes “important” errors. For example, if classes wk, k=1,..10, are degrees of severity of heart problems, than mixing l12 < l19 (c) 1999. Tralvex Yeap. All Rights Reserved

Bayesian risk Select the class corresponding to minimum conditional risk: This is equivalent to the following decision procedure Ĉ: Define conditional risk of performing action wl given X d is the value of the risk of not taking any action. Discrimination functions gk(X) provide minimal risk decision boundaries if good approximation to P(X|w) probabilities are found. (c) 1999. Tralvex Yeap. All Rights Reserved

Discriminating functions gk(X) may also be replaced by posterior probabilities; in the feature space area where the lowest risk corresponds to decision wi we have gi(X) > gj(X), and gi(X) = gj(X) at the border. Any monotonic function of P(wi|X) may be used as discriminating f: Some choices: Majority classifier maximum likelihood classifier MAP, Maximum A Posterior classifier minimum risk classifier (best) (c) 1999. Tralvex Yeap. All Rights Reserved

Summary Posterior probabilities is what we want: Likelihood x Prior _________________ Evidence Minimization of error may be done in a number of ways, for ex: where Ci(X) = 1 if X is from class wi, and 0 otherwise. Minimization of risk takes into account cost of different types of errors: using Bayesian decision procedure minimizes total/conditional risks. Try to write risk for 2 class 1D problem with P(w1)=2/3, P(w2)=1/3, P(X|w1)=Gauss(0,1), P(X|w2)=Gauss(2,2), (c) 1999. Tralvex Yeap. All Rights Reserved

1D Gaussians Simplest application of Bayesian decisions: Gaussian distributions. 2 distributions, 1D, with the same variance, different means: Discriminating function: log posteriors (c) 1999. Tralvex Yeap. All Rights Reserved

1D Gaussians solution g(X)>0 for class w1, g(X)<0 for class w2 After simplifications: g(X)>0 for class w1, g(X)<0 for class w2 For equal a priori probabilities g(X0)=0 in the middle between the means of the two distributions. Otherwise it is shifted towards class with smaller prior: (c) 1999. Tralvex Yeap. All Rights Reserved

dD Gaussians DM(X,m)=const Multivariate Gaussians: Non-diagonal covariance matrix rotates the distributions. Discriminant functions are quadratic: DM(X,m)=const Discriminant function is thus Mahalanobis distance from the mean + terms that take into account the prior probability and the variance. (c) 1999. Tralvex Yeap. All Rights Reserved

Linear classifiers Identical covariance matrices => linear discrimination function: where: This type of linear discrimination functions are used quite frequently, not only in statistical theories but also in neural networks, where: X are input signals, W are synaptic weights, and W0 is the neuron activation threshold. (c) 1999. Tralvex Yeap. All Rights Reserved

Special cases If identical covariance matrices and identical a priori probabilities are assumed then Mahalanobis distance is obtained: Note: max g(X)  min D(X,m) For diagonal covariance matrices distance becomes: Weighted Euclidean distance function from the center of the Gaussian distribution serves as discriminating function. Bayesian classifier becomes now a nearest prototype classifier: (c) 1999. Tralvex Yeap. All Rights Reserved

Illustration Influence of priors on decision borders: in 1 D In 2D Note that decision borders are not between means. Fig. 2.11, Duda, Hart, Stork (c) 1999. Tralvex Yeap. All Rights Reserved

Naive Bayesian classifier How to estimate probability densities required by Bayesian approach? P(X|w)=P(X1,X2..Xd|w) in d-dimensions and b values/feature, requires bd bins to estimated numerically density of the vectors! In practice there is never enough data for that! Assume (naively) that all features are conditionally independent. Instead of multidimensional function the problem is reduced to estimation of d one-dimensional Pi functions. If the class w is a disease than symptoms X1, X2, depends on the class but not directly on each other, so usually this may work. It works for Gaussian density distributions for classes; perhaps feature selection has already removed correlated features. (c) 1999. Tralvex Yeap. All Rights Reserved

NB Estimate (learn from data) the posterior probability: Use the log-likelihood ratio: If the overall result is positive than class w1 is more likely. Each feature has an additive, positive or negative, contribution. For discrete or symbolic features P(Xi|w) is estimated from histograms, for continuous features a single Gaussian or a combination of Gaussians is frequently fit to estimate the density. (c) 1999. Tralvex Yeap. All Rights Reserved

NB Iris example nbc(iris_type) = { prob(iris_type) = { Iris-setosa : 50, Iris-versicolor: 50, Iris-virginica : 50 }; prob(sepal_length|iris_type) = { Iris-setosa : N(5.006, 0.124249) [50], (Normal distribution) Iris-versicolor: N(5.936, 0.266433) [50], (mean, std) Iris-virginica : N(6.588, 0.404343) [50] }; prob(sepal_width|iris_type) = { Iris-setosa : N(3.428, 0.14369) [50], Iris-versicolor: N(2.77, 0.0984694) [50], Iris-virginica : N(2.974, 0.104004) [50] }; prob(petal_length|iris_type) = { Iris-setosa : N(1.462, 0.0301592) [50], Iris-versicolor: N(4.26, 0.220816) [50], Iris-virginica : N(5.552, 0.304588) [50] }; prob(petal_width|iris_type) = { Iris-setosa : N(0.246, 0.0111061) [50], Iris-versicolor: N(1.326, 0.0391061) [50], Iris-virginica : N(2.026, 0.0754327) [50] }; Use RapidMiner, Weka, or applet: http://www.cs.technion.ac.il/~rani/LocBoost/ (c) 1999. Tralvex Yeap. All Rights Reserved

NB on fuzzy XOR example For demonstrations one may use WEKA, GhostMiner 3 or an applet: http://www.cs.technion.ac.il/~rani/LocBoost/ NB fails completely on such data – why? P(X1,X2|wk) probabilities are needed, but they are more difficult to estimate. Use local probability around your query point. (c) 1999. Tralvex Yeap. All Rights Reserved

NB generalization If the first-order (independent) feature model is not sufficient for strongly interacting features one can introduce second and third-order dependencies (to find interacting features look at covariance matrix). This leads to graphical causal Bayesian network models. Surprisingly: frequently adding explicit correlations makes things worse, presumably because difficulty in reliable estimation of two or three dimensional densities. NB is quite accurate on many data sets, and fairly popular, has many free software packages and numerous applications. Local approach: make all estimations only around X! (c) 1999. Tralvex Yeap. All Rights Reserved

Literature Duda, Hart & Stork Chap. 2 has a good introduction to the Bayesian decision theory, including some more advanced sections: on min-max, or overall risk minimization which is independent of prior probabilities; or Neyman-Pearson criteria that introduce constraints on acceptable risk for selected classes. More detailed exposition of Bayesian decision and risk theory may be found in chapter 2 of: S. Theodoridis, K. Koutroumbas, Pattern recognition (2nd ed) Decision theory is obviously connected with hypothesis testing. Chapter 3 of K. Fukunaga, Introduction to statistical pattern recognition (2nd ed, 1990), is probably the best source, with numerous examples. (c) 1999. Tralvex Yeap. All Rights Reserved