Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:

Slides:



Advertisements
Similar presentations
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Advertisements

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Bayesian Decision Theory
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Visual Recognition Tutorial
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Thanks to Nir Friedman, HU
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 E. Fatemizadeh Statistical Pattern Recognition.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Optimal Bayes Classification
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
Chapter 3: Maximum-Likelihood Parameter Estimation
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
LECTURE 23: INFORMATION THEORY REVIEW
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Computational Intelligence: Methods and Applications
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Bayesian Decision Theory
Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

LearningLearning Learning from data requires a model of data. Traditionally parametric models of different phenomena were developed in science and engineering; parametric models are easy to interpret, so if the phenomenon is simple enough and a theory exist construct a model and use algorithmic approach. Empirical non-parametric modeling is data driven, goal oriented. It dominates in biological systems. Learn from data! Given some examples = training data, create a model of data that answers specific question, estimating those characteristics of the data that may be useful to make future predictions. Learning = estimate parameters of the (non-parametric) model; paradoxically, non-parametric models have a lot of parameters. Many other approaches to learning exist, but no time to explore...

ProbabilityProbability To talk about prediction errors probability concepts are needed. Samples X  X divided into K categories, called classes  1...  K More general,  i is a state of nature that we would like to predict. P k = P(  k ), a priori (unconditional) probability of observing X   k If nothing else is known than one should predict that a new sample X belongs to the majority class: Majority classifier: assigns all new X to the most frequent class. Example: weather prediction system – the weather tomorrow will be the same as yesterday (high accuracy prediction!).

Conditional probability Predictions should never be worse than for the majority classifier! Usually class-conditional probability is also known or may easily be measured, the condition here is that X  k Joint probability of observing X from  k Is the knowledge of conditional probability sufficient to make predictions? No! We are interested in the posterior P. Fig. 2.1, from Duda, Hart, Stork, Pattern Classification (Wiley).

Bayes rule Posterior conditional probabilities are normalized: Bayes rule for 2 classes is derived from this: P(X) is an unconditional probability of selecting sample X; usually it is just 1/n, where n=number of all samples. For P 1 =2/3 and P 2 =1/3 previous figure is: Fig. 2.2, from Duda, Hart, Stork, Pattern Classification (Wiley).

Bayes decisions Bayes decision: given a sample X select class 1 if: Probability of an error is: Average error is: Bayes decision rule minimizes average error selecting smallest P(  |X) Using Bayes rule and multiplying both sides by P(X):

LikelihoodLikelihood Bayesian approach to learning: use data to model probabilities. Bayes decision depends on the likelihood ratio: For equal a priori probabilities class conditional probabilities decide. On a finite data sample given for training the error is: The assumption here is that the P(X) is reflected in the frequency of samples for different X. Fig. 2.3, from Duda, Hart, Stork, Pattern Classification (Wiley).

2D decision regions For Gaussian distribution of class conditional probabilities: Decision boundaries in 2D are hyperbolic, decision region R 2 is disconnected. The ellipsis show high constant values of P k (X). Fig. 2.6, from Duda, Hart, Stork, Pattern Classification (Wiley).

ExampleExample Let  1 be the state of nature called “dengue”, and  2 the opposite, no dengue. Let prior probability for people in Singapore be P(  1 )=0.1% Let test T be accurate in 99%, so that the positive outcome of the test for people with dengue is P(T=+|  1 ) = 0.99, and negative for healthy people is also P(T=  |  2 ) = What is the chance that you have dengue if your test is positive? What is the probability P(  1 |T=+)? P(T=+) = P(  1,T=+)+P(  2,T=+) = 0.99* *0.999=0.011 P(  1 |T=+)=P(T=+|  1 )P(  1 )/P(T=+) = 0.99*0.001/0.011 = 0.09, or 9% Use this calculator to check:

Statistical theory of decisions Decisions carry risk, costs and losses. Consider general decision procedure: {  1..  K }, states of nature {  1..  a }, actions that may be taken (  i,  j ), cost, loss or risk associated with action  j in state  i Example: classification decisions Ĉ: X  {  1..  K,  D,  O }, Action  i is assigning to vector X a class label 1.. K, or  D – no confidence in classification, reject/leave sample as unclassified  O – outlier, untypical case, perhaps a new class (used rarely).

Errors and losses Unconditional probability of wrong (non-optimal) action (decision), if the true state is  k, and prediction was wrong: No action (no decision), or rejection of sample X if the true class is  k, has probability: Assume simplest 0/1 loss function: no cost if optimal decision is taken, identical costs for all errors and some costs for no action (rejection):

... and risks Risk of the decision making procedure Ĉ when the true state is  k, with  K+1 =  D where P kl are elements of the confusion matrix P: Note that rows of P correspond to the true  k classes, and columns to the predicted  l classes, l =1..K+1 or classifier’s decisions.

... and more risks Trace of the confusion matrix: is equal to accuracy of the classifier, ignoring costs of mistakes. Total risk of the decision procedure Ĉ : Conditional risk of assigning sample X to class  k is: For special case of costs of mistakes =1 and rejection =  d