2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Slides:



Advertisements
Similar presentations
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Bayesian Decision Theory
What is Statistical Modeling
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
OUTLINE Course description, What is pattern recognition, Cost of error, Decision boundaries, The desgin cycle.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Machine Learning CMPT 726 Simon Fraser University
Decision Theory Naïve Bayes ROC Curves
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1. Introduction to Pattern Recognition and Machine Learning. Prof. A.L. Yuille. Dept. Statistics. UCLA. Stat 231. Fall 2004.
Crash Course on Machine Learning
Bayesian Decision Theory Making Decisions Under uncertainty 1.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
Lecture 2: Bayesian Decision Theory 1. Diagram and formulation
Non-Parametric Learning Prof. A.L. Yuille Stat 231. Fall Chp 4.1 – 4.3.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 3. Bayes Decision Theory: Part II. Prof. A.L. Yuille Stat 231. Fall 2004.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Optimal Bayes Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Covariance matrices for all of the classes are identical, But covariance matrices are arbitrary.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN Modified by Prof. Carolina Ruiz © The MIT Press, 2014for CS539 Machine Learning at WPI
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Lecture 2. Bayesian Decision Theory
Lecture 15. Pattern Classification (I): Statistical Formulation
Bounding the error of misclassification
LECTURE 03: DECISION SURFACES
Bias and Variance of the Estimator
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
INTRODUCTION TO Machine Learning 3rd Edition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
LECTURE 11: Exam No. 1 Review
Presentation transcript:

2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Decisions with Uncertainty Bayes Decision Theory is a theory for how to make decisions in the presence of uncertainty. Input data x. Salmon y= +1, Sea Bass y=-1. Learn decision rule: f(x) taking values

Decision Rule for Fish. Classify fish as Salmon or Sea Bass by decision rule f(x).

Basic Ingredients. Assume there are probability distributions for generating the data. P(x|y=1) and P(x|y=-1). Loss function L(f(x),y) specifies the loss of making decision f(x) when true state is y. Distribution P(y). Prior probability on y. Joint Distribution P(x,y) = P(x|y) P(y).

Minimize the Risk The risk of a decision rule f(x) is: Bayes Decision Rule f*(x): The Bayes Risk:

Minimize the Risk. Write P(x,y) = P(y|x) P(x). Then we can write the Risk as: The best decision for input x is f*(x):

Bayes Rule. Posterior distribution P(y|x): Likelihood function P(x|y) Prior P(y). Bayes Rule has been controversial (historically) because of the Prior P(y) (subjective?). But in Bayes Decision Theory, everything starts from the joint distribution P(x,y).

Risk. The Risk is based on averaging over all possible x & y. Average Loss. Alternatively, can try to minimize the worst risk over x & y. Minimax Criterion. This course uses the Risk, or average loss.

Generative & Discriminative. Generative methods aim to determine probability models P(x|y) & P(y). Discriminative methods aim directly at estimating the decision rule f(x). Vapnik argues for Discriminative Methods: Don’t solve a harder problem than you need to. Only care about the probabilities near the decision boundaries.

Discriminant Functions. For two category case the Bayes decision rule depends on the discriminant function: The Bayes decision rule is of form: Where T is a threshold, which is determined by the loss function.

Two-State Case Detect “target” or “non-target”. Let loss function pay a penalty of 1 for misclassification, 0 otherwise. Risk becomes Error. Bayes Risk becomes Bayes Error. Error is the sum of false positives F+ ( non- targets classified as targets) and false negatives F- ( targets classified as non-targets ).

Gaussian Example: 1 Is a bright light flashing? n is no. photons emitted by dim or bright light.

8. Gaussian Example: 2 are Gaussians with means and s.d.. Bayes decision rule selects “dim” if ; Errors:

Example: Multidimensional Gaussian Distributions. Suppose the two classes have Gaussian distributions for P(x|y). Different means but same covariance The discriminant function is a plane: Alternatively, seek a planar decision rule without attempting to model the distributions. Only care about the data near the decision boundary.

Generative vrs. Discriminant. The Generative approach will attempt to estimate the Gaussian distributions from data – and then derive the decision rule. The Discriminant approach will seek to estimate the decision rule directly by learning the discriminant plane. In practice, we will not know the form of the distributions of the form of the discriminant.

Gaussian. Gaussian Case with unequal covariance.

Discriminative Models & Features. In practice, the Discriminative methods are usually defined based on features extracted from the data. (E.g. length and brightness of fish). Calculate features z=h(x). Bayes Decision Theory says that this throws away information. Restrict to a sub-class of possible decision rules – those that can be expressed in terms of features z=h(x).

Bayes Decision Rule and Learning. Bayes Decision Theory assumes that we know, or can learn, the distributions P(x|y). This is often not practical, or extremely difficult. In real problems, you have a set of classified data You can attempt to learn P(x|y=+1) & P(x|y=-1) from these (next few lectures). Parametric & Non-parametric approaches. Question: when do you have enough data to learn these probabilities accurately? Depends on the complexity of the model.

Machine Learning. Replace Risk by Empirical Risk How does minimizing the empirical risk relate to minimizing the true risk? Key Issue: When can we generalize? Be confident that the decision rule we have learnt on the training data will yield good results on unseen data?

Machine Learning Vapnik’s theory gives a mathematically elegant way of answering these issues. It assumes that the data is sampled from an unknown distribution. Vapnik’s theory gives bounds for when we can generalize. Unfortunately these bounds are very conservative. In practice, train on part of dataset and test on other part(s).

Extensions to Multiple Classes The decision partitionsf the feature space into k subspaces Conceptually straightforward – see Duda, Hart & Stork.