2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
Decisions with Uncertainty Bayes Decision Theory is a theory for how to make decisions in the presence of uncertainty. Input data x. Salmon y= +1, Sea Bass y=-1. Learn decision rule: f(x) taking values
Decision Rule for Fish. Classify fish as Salmon or Sea Bass by decision rule f(x).
Basic Ingredients. Assume there are probability distributions for generating the data. P(x|y=1) and P(x|y=-1). Loss function L(f(x),y) specifies the loss of making decision f(x) when true state is y. Distribution P(y). Prior probability on y. Joint Distribution P(x,y) = P(x|y) P(y).
Minimize the Risk The risk of a decision rule f(x) is: Bayes Decision Rule f*(x): The Bayes Risk:
Minimize the Risk. Write P(x,y) = P(y|x) P(x). Then we can write the Risk as: The best decision for input x is f*(x):
Bayes Rule. Posterior distribution P(y|x): Likelihood function P(x|y) Prior P(y). Bayes Rule has been controversial (historically) because of the Prior P(y) (subjective?). But in Bayes Decision Theory, everything starts from the joint distribution P(x,y).
Risk. The Risk is based on averaging over all possible x & y. Average Loss. Alternatively, can try to minimize the worst risk over x & y. Minimax Criterion. This course uses the Risk, or average loss.
Generative & Discriminative. Generative methods aim to determine probability models P(x|y) & P(y). Discriminative methods aim directly at estimating the decision rule f(x). Vapnik argues for Discriminative Methods: Don’t solve a harder problem than you need to. Only care about the probabilities near the decision boundaries.
Discriminant Functions. For two category case the Bayes decision rule depends on the discriminant function: The Bayes decision rule is of form: Where T is a threshold, which is determined by the loss function.
Two-State Case Detect “target” or “non-target”. Let loss function pay a penalty of 1 for misclassification, 0 otherwise. Risk becomes Error. Bayes Risk becomes Bayes Error. Error is the sum of false positives F+ ( non- targets classified as targets) and false negatives F- ( targets classified as non-targets ).
Gaussian Example: 1 Is a bright light flashing? n is no. photons emitted by dim or bright light.
8. Gaussian Example: 2 are Gaussians with means and s.d.. Bayes decision rule selects “dim” if ; Errors:
Example: Multidimensional Gaussian Distributions. Suppose the two classes have Gaussian distributions for P(x|y). Different means but same covariance The discriminant function is a plane: Alternatively, seek a planar decision rule without attempting to model the distributions. Only care about the data near the decision boundary.
Generative vrs. Discriminant. The Generative approach will attempt to estimate the Gaussian distributions from data – and then derive the decision rule. The Discriminant approach will seek to estimate the decision rule directly by learning the discriminant plane. In practice, we will not know the form of the distributions of the form of the discriminant.
Gaussian. Gaussian Case with unequal covariance.
Discriminative Models & Features. In practice, the Discriminative methods are usually defined based on features extracted from the data. (E.g. length and brightness of fish). Calculate features z=h(x). Bayes Decision Theory says that this throws away information. Restrict to a sub-class of possible decision rules – those that can be expressed in terms of features z=h(x).
Bayes Decision Rule and Learning. Bayes Decision Theory assumes that we know, or can learn, the distributions P(x|y). This is often not practical, or extremely difficult. In real problems, you have a set of classified data You can attempt to learn P(x|y=+1) & P(x|y=-1) from these (next few lectures). Parametric & Non-parametric approaches. Question: when do you have enough data to learn these probabilities accurately? Depends on the complexity of the model.
Machine Learning. Replace Risk by Empirical Risk How does minimizing the empirical risk relate to minimizing the true risk? Key Issue: When can we generalize? Be confident that the decision rule we have learnt on the training data will yield good results on unseen data?
Machine Learning Vapnik’s theory gives a mathematically elegant way of answering these issues. It assumes that the data is sampled from an unknown distribution. Vapnik’s theory gives bounds for when we can generalize. Unfortunately these bounds are very conservative. In practice, train on part of dataset and test on other part(s).
Extensions to Multiple Classes The decision partitionsf the feature space into k subspaces Conceptually straightforward – see Duda, Hart & Stork.