Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Decision Theory

Similar presentations


Presentation on theme: "Bayesian Decision Theory"— Presentation transcript:

1 Bayesian Decision Theory
Z. Ghassabi

2 Outline What is pattern recognition? Need for Probabilistic Reasoning
What is classification? Need for Probabilistic Reasoning Probabilistic Classification Theory What is Bayesian Decision Theory? HISTORY PRIOR PROBABILITIES CLASS-CONDITIONAL PROBABILITIES BAYES FORMULA A Casual Formulation Decision

3 Outline What is Bayesian Decision Theory? Decision fot Two Categories

4 Outline What is classification?
Classification by Bayesian Classification Basic Concepts Bayes Rule More General Forms of Bayes Rule Discriminated Functions Bayesian Belief Networks

5 IMAGE PROCESSING EXAMPLE
What is pattern recognition? TYPICAL APPLICATIONS OF PR IMAGE PROCESSING EXAMPLE Sorting Fish: incoming fish are sorted according to species using optical sensing (sea bass or salmon?) Feature Extraction Segmentation Sensing Problem Analysis: set up a camera and take some sample images to extract features Consider features such as length, lightness, width, number and shape of fins, position of mouth, etc. ماهی خاردار و ماهی آزاد Suppose that a fish packing plant wants to automate the process ofsorting incoming fish on a conveyor belt according to species. As a pilot project it is decided to try to separate sea bass from salmon using optical sensing. We set up a camera, take some sample images and begin to note some physical differences between the two types offish — length, lightness, width, number and shape offins, position ofthe mouth, and so on—and these suggest features to explore for use in our classifier. First the camera captures an image ofthe fish. Next, the camera’s pre- signals are preprocessed to simplify subsequent operations without loosing relevant processing information. In particular, we might use a segmentation operation in which the images segmentation of different fish are somehow isolated from one another and from the background. The information from a single fish is then sent to a feature extractor, whose purpose is to feature extraction reduce the data by measuring certain “features” or “properties.” These features (or, more precisely, the values ofthese features) are then passed to a classifier that evaluates the evidence presented and makes a final decision as to the species.

6 Pattern Classification System
Preprocessing Segment (isolate) fishes from one another and from the background Feature Extraction Reduce the data by measuring certain features Classification Divide the feature space into decision regions

7

8 Classification Initially use the length of the fish as a possible feature for discrimination

9 LENGTH AS A DISCRIMINATOR Length is a poor discriminator
TYPICAL APPLICATIONS LENGTH AS A DISCRIMINATOR To choose l∗ we could obtain some design or training training samples ofthe different types offish, (somehow) make length measurements, samples and inspect the results. Length is a poor discriminator

10 Feature Selection The length is a poor feature alone!
Select the lightness as a possible feature

11 TYPICAL APPLICATIONS ADD ANOTHER FEATURE
Now we are very careful to eliminate variations in illumination, since they can only obscure the models and corrupt our New classifier. For instance, as a fish packing company we may know that our customers easily accept occasional pieces oftast y salmon in their cans labeled “sea bass,” but they object vigorously ifa piece ofsea bass appears in their cans labeled “salmon.” Ifw e want to stay in business, we should adjust our decision boundary to avoid antagonizing our customers, even ifit means that more salmon makes its way into the cans ofsea bass. In this case, then, we should move our decision boundary x∗ to smaller values ofligh tness, thereby reducing the number ofsea bass that are classified as salmon (Fig. 1.3). The more our customers object to getting sea bass with their salmon—i.e., the more costly this type oferror —the lower we should set the decision threshold x∗ in Fig. 1.3. Such considerations suggest that there is an overall single cost associated with our decision, and our true task is to make a decision rule (i.e., set a decision boundary) so as to minimize such a cost. This is the central task of decision theory ofwhic h decision pattern classification is perhaps the most important subfield. Lightness is a better feature than length because it reduces the misclassification error. Can we combine features in such a way that we improve performance? (Hint: correlation)

12 Task of decision theory
Threshold decision boundary and cost relationship Move decision boundary toward smaller values of lightness in order to minimize the cost (reduce the number of sea bass that are classified salmon!) Task of decision theory

13 Feature Vector Adopt the lightness and add the width of the fish to the feature vector Fish xT = [x1, x2] Ifw e ignore how these features might be measured in practice, we realize that the feature extractor has thus reduced the image ofeac h fish to a point or feature vector x in a two-dimensional feature space, Lightness Width

14 Straight line decision boundary
TYPICAL APPLICATIONS WIDTH AND LIGHTNESS Straight line decision boundary Treat features as a N-tuple (two-dimensional vector) Create a scatter plot Draw a line (regression) separating the two classes Classify the fish as sea bass if its feature vector falls above the decision boundary shown, and as salmon otherwise. This rule appears to do a good job ofseparating our samples and suggests that perhaps incorporating yet more features would be desirable. Besides the lightness and width ofthe fish, we might include some shape parameter, such as the vertex angle ofthe dorsal fin, or the placement ofthe eyes (as expressed as a proportion of the mouth-to-tail distance), and so on. How do we know beforehand which of these features will work best? Some features might be redundant: for instance if the eye color of all fish correlated perfectly with width, then classification performance need not be improved ifw e also include eye color as a feature. Even ifthe difficulty or computational cost in attaining more features is of no concern, might we ever have too many features?

15 Features We might add other features that are not highly correlated with the ones we already have. Be sure not to reduce the performance by adding “noisy features” Ideally, you might think the best decision boundary is the one that provides optimal performance on the training data (see the following figure) Suppose that other features are too expensive or expensive to measure, or provide little improvement (or possibly even degrade the performance) in the approach described above, and that we are forced to make our decision based on the two features in Fig Ifour models were extremely complicated, our classifier would have a decision boundary more complex than the simple straight line.

16 Is this a good decision boundary?
TYPICAL APPLICATIONS DECISION THEORY Can we do better than a linear classifier? Figure 1.5: Overly complex models for the fish will lead to decision boundaries that are complicated. While such a decision may lead to perfect classification of our training samples, it would lead to poor performance on future patterns. The novel test point marked ? is evidently most likely a salmon, whereas the complex decision boundary shown leads it to be misclassified as a sea bass. With such a “solution,” though, our satisfaction would be premature because the central aim of designing a classifier is to suggest actions when presented with novel patterns, i.e., fish not yet general- seen. This is the issue of generalization. It is unlikely that the complex ization decision boundary in Fig. 1.5 would provide good generalization, since it seems to be “tuned” to the particular training samples, rather than some underlying characteristics or true model ofall the sea bass and salmon that will have to be separated. Rather, then, we might seek to “simplify” the recognizer, motivated by a belief that the underlying models will not require a decision boundary that is as complex as that in Fig Indeed, we might be satisfied with the slightly poorer performance on the training samples ifit means that our classifier will have better performance on novel patterns.∗ But ifdesigning a very complex recognizer is unlikely to give good generalization, precisely how should we quantify and favor simpler classifiers? How would our system automatically determine that the simple curve in Fig. 1.6 is preferable to the manifestly simpler straight line in Fig. 1.4 or the complicated boundary in Fig. 1.5? Assuming that we somehow manage to optimize this tradeoff, can we then predict how well our system will generalize to new patterns? These are some ofthe central problems in statistical pattern recognition. Is this a good decision boundary? What is wrong with this decision surface? (hint: generalization)

17 Issue of generalization!
Decision Boundary Choice Our satisfaction is premature because the central aim of designing a classifier is to correctly classify new (test) input Issue of generalization!

18 GENERALIZATION AND RISK
TYPICAL APPLICATIONS GENERALIZATION AND RISK Better decision boundary Why might a smoother decision surface be a better choice? (hint: Occam’s Razor). Figure 1.6: The decision boundary shown might represent the optimal tradeoff between performance on the training set and simplicity of classifier. This, too, should give us added appreciation ofthe ability of humans to switch rapidly and fluidly between pattern recognition tasks. We might wish to favor a small number of features, which might lead to simpler decision regions, and a classifier easier to train. We might also wish to have features that are robust, i.e., relatively insensitive to noise or other errors. In practical applications we may need the classifier to act quickly, or use few electronic components, memory or processing steps. PR investigates how to find such “optimal” decision surfaces and how to provide system designers with the tools to make intelligent trade-offs.

19 Need for Probabilistic Reasoning
Most everyday reasoning is based on uncertain evidence and inferences. Classical logic, which only allows conclusions to be strictly true or strictly false, does not account for this uncertainty or the need to weigh and combine conflicting evidence. Todays expert systems employed fairly ad hoc methods for reasoning under uncertainty and for combining evidence. اغلب استدلال بر شواهد و استنتاج نامشخص است. منطق کلاسیک، که فقط اجازه می دهد تا نتایج به شدت درست است یا به شدت نادرست، برای این عدم اطمینان و یا نیاز به وزن و ترکیب شواهد متناقض را نمیشود بکار برد. سیستم های خبره امروز :برای استدلال تحت عدم قطعیت و برای ترکیب شواهد.

20 Probabilistic Classification Theory
In practice, some overlap between classes and random variation within classes occur, hence perfect separation between classes can not be achieved: Misclassification may occur. تئوری Bayesian decision اجازه می دهد تا تابع هزینه ای ارائه شود برای موقعی که ممکن است که یک بردار ورودی که عضوی از کلاس A است اشتباها به کلاس B نسبت داده شود. (misclassify)

21 What is Bayesian Decision Theory?
HISTORY Bayesian Probability was named after Reverend Thomas Bayes ( ). He proved a special case of what is currently known as the Bayes Theorem. The term “Bayesian” came into use around the 1950’s. Bayesset out his theory of probabilityin Essay towards solving a problem in the doctrine of chancespublished in the Philosophical Transactions of the Royal Society of Londonin 1764. The paper was sent to the Royal Societyby Richard Price, a friend of Bayes', who wrote:- I now send you an essay which I have found among the papers of our deceased friend MrBayes, and which, in my opinion, has great merit... In an introduction which he has writ to this Essay, he says, that his design at first in thinking on the subject of it was, to find out a method by which we might judge concerning the probability that an event has to happen, in given circumstances, upon supposition that we know nothing concerning it but that, under the same circumstances, it has happened a certain number of times, and failed a certain other number of times.

22 HISTORY (Cont.) Pierre-Simon, Marquis de Laplace ( ) independently proved a generalized version of Bayes Theorem. 1970 Bayesian Belief Network at Stanford University (Judea Pearl 1988) The idea’s proposed above was not fully developed until later. BBN became popular in the 1990s. Fundamental statistical approach to statistical pattern classification

23 HISTORY (Cont.) Current uses of Bayesian Networks:
Microsoft’s printer troubleshooter. Diagnose diseases (Mycin). Used to predict oil and stock prices Control the space shuttle Risk Analysis – Schedule and Cost Overruns. تجزیه و تحلیل خطر - زمانبندی و هزینه سرریز

24 PROBABILISTIC DECISION THEORY
BAYESIAN DECISION THEORY PROBABILISTIC DECISION THEORY Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. Using probabilistic approach to help making decision (e.g., classification) so as to minimize the risk (cost). Assume all relevant probability distributions are known (later we will learn how to estimate these from data). In statistical pattern recognition we focus on the statistical properties ofthe patterns (generally expressed in probability densities), and this will command most ofour attention in this book. These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass or salmon before the fish actually appears. It might, for instance, depend upon the time of year or the choice of fishing area.

25 BAYESIAN DECISION THEORY
PRIOR PROBABILITIES State of nature is prior information  denote the state of nature Model as a random variable, :  = 1: the event that the next fish is a sea bass category 1: sea bass; category 2: salmon A priori probabilities: P(1) = probability of category 1 P(2) = probability of category 2 P(1) + P( 2) = 1 (either 1 or 2 must occur) Decision rule Decide 1 if P(1) > P(2); otherwise, decide 2 Random Variables A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. There are two types of random variables, discrete and continuous. Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten. The probability distribution of a discrete random variable is a list of probabilities associated with each of its possible values. It is also sometimes called the probability function or the probability mass function In probability and statistics, a random variable, aleatory variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense).[1]:391 A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability (in contrast to other mathematical variables). The mathematical function describing the possible values of a random variable and their associated probabilities is known as a probability distribution This rule makes sense if we are to judge just one fish, but if we are to judge many fish, using this rule repeatedly may seem a bit strange. But we know there will be many mistakes ….

26 CLASS-CONDITIONAL PROBABILITIES
BAYESIAN DECISION THEORY CLASS-CONDITIONAL PROBABILITIES A decision rule with only prior information always produces the same result and ignores measurements. If P(1) >> P( 2), we will be correct most of the time. Given a feature, x (lightness), which is a continuous random variable, p(x|2) is the class-conditional probability density function: xPdfsshow the probability of measuring a particular feature value given category ωj.If xis a feature value, the two curves describe the difference in populations of two types of classes.Density functions are normalized--thus area under each curve is In most circumstances we are not asked to make decisions with so little information. In our example, we might for instance use a lightness measurement x to improve our classifier. Different fish will yield different lightness readings and we express this variability in probabilistic terms; Then the difference between p(x|ω1) and p(x|ω2) describes the difference in lightness between populations of sea bass and salmon ( p(x|1) and p(x|2) describe the difference in lightness between populations of sea and salmon.

27 Let x be a continuous random variable.
p(x|w) is the probability density for x given the state of nature w. p(lightness | salmon) ? P(lightness | sea bass) ? Suppose that we know both the prior probabilities P(ωj) and the conditional densities p(x|ωj ). Suppose further that we measure the lightness of a fish and discover that its value is x. How does this measurement influence our attitude concerning the true state of nature — that is, the category of the fish?

28 BAYESIAN DECISION THEORY
BAYES FORMULA How do we combine a priori and class-conditional probabilities to know the probability of a state of nature? Suppose we know both P(j) and p(x|j), and we can measure x. How does this influence our decision? The joint probability that of finding a pattern that is in category j and that this pattern has a feature value of x is: Rearranging terms, we arrive at Bayes formula.

29 POSTERIOR PROBABILITIES
A Casual Formulation The prior probability reflects knowledge of the relative frequency of instances of a class The likelihood is a measure of the probability that a measurement value occurs in a class The evidence is a scaling term BAYESIAN DECISION THEORY POSTERIOR PROBABILITIES Bayes formula: can be expressed in words as: By measuring x, we can convert the prior probability, P(j), into a posterior probability, P(j|x). Evidence can be viewed as a scale factor and is often ignored in optimization applications (e.g., speech recognition). For two categories: Bayes Decision: Choose w1 if P(w1|x) > P(w2|x); otherwise choose w2.

30 BAYESIAN THEOREM A special case of Bayesian Theorem:
P(A∩B) = P(B) x P(A|B) P(B∩A) = P(A) x P(B|A) Since P(A∩B) = P(B∩A), P(B) x P(A|B) = P(A) x P(B|A) => P(A|B) = [P(A) x P(B|A)] / P(B) A B

31 Preliminaries and Notations
a state of nature prior probability feature vector class-conditional density posterior probability

32 Decision Decide i if P(i|x) > P(j|x)  j  i
The evidence, p(x), is a scale factor that assures conditional probabilities sum to 1: P(1|x)+P(2|x)=1 We can eliminate the scale factor (which appears on both sides of the equation): Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i Special cases: P(1)=P(2)=   =P(c) p(x|1)=p(x|2) =   = p(x|c) decision is based entirely on the likelihood, p(x|j). x gives us no useful information

33 Two Categories Special cases: P(1)=P(2)
Decide i if P(i|x) > P(j|x)  j  i Decide i if p(x|i)P(i) > p(x|j)P(j)  j  i Decide 1 if P(1|x) > P(2|x); otherwise decide 2 Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2 Special cases: P(1)=P(2) Decide 1 if p(x|1) > p(x|2); otherwise decide 2 2. p(x|1)=p(x|2) Decide 1 if P(1) > P(2); otherwise decide 2

34 Example R2 R1 P(1)=P(2)

35 Example R2 R1 P(1)=2/3 P(2)=1/3
Bayes Decision Rule Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

36 POSTERIOR PROBABILITIES
BAYESIAN DECISION THEORY POSTERIOR PROBABILITIES Two-class fish sorting problem (P(1) = 2/3, P(2) = 1/3): For every value of x, the posteriors sum to 1.0. At x=14, the probability it is in category 2 is 0.08, and for category 1 is 0.92.

37 BAYESIAN DECISION THEORY
BAYES DECISION RULE Classification Error Decision rule: For an observation x, decide 1 if P(1|x) > P(2|x); otherwise, decide 2 Probability of error: The average probability of error is given by: If for every x we ensure that P(error|x) is as small as possible, then the integral is as small as possible. Thus, Bayes decision rule for minimizes P(error). Consider two categories:

38 GENERALIZATION OF TWO-CLASS PROBLEM
CONTINUOUS FEATURES GENERALIZATION OF TWO-CLASS PROBLEM Generalization of the preceding ideas: Use of more than one feature (e.g., length and lightness) Use more than two states of nature (e.g., N-way classification) Allowing actions other than a decision to decide on the state of nature (e.g., rejection: refusing to take an action when alternatives are close or confidence is low) Introduce a loss of function which is more general than the probability of error (e.g., errors are not equally costly) Let us replace the scalar x by the vector x in a d-dimensional Euclidean space, Rd, called the feature space. • by allowing the use of more than one feature • by allowing more than two states of nature • by allowing actions other than merely deciding the state of nature • by introducing a loss function more general than the probability of error.

39 The Generation a set of c states of nature or c categories
a set of a possible actions LOSS FUNCTION The loss incurred for taking action i when the true state of nature is j. Allowing actions other than classification primarily allows the possibility of rejection, i.e., of refusing to make a decision in close cases; this is a useful option if being indecisive is not too costly. Formally, the loss function states exactly how costly loss each action is, and is used to convert a probability determination into a decision. can be zero. Risk We want to minimize the expected loss in making decision.

40 Examples Ex 1: Fish classification Ex 2: Medical diagnosis
X= is the image of fish x =(brightness, length, fin #, etc.) is our belief what the fish type is = {“sea bass”, “salmon”, “trout”, etc} is a decision for the fish type, in this case = {“sea bass”, “salmon”, “trout”, “manual expection needed”, etc} Ex 2: Medical diagnosis X= all the available medical tests, imaging scans that a doctor can order for a patient x =(blood pressure, glucose level, cough, x-ray, etc.) is an illness type ={“Flu”, “cold”, “TB”, “pneumonia”, “lung cancer”, etc} is a decision for treatment, = {“Tylenol”, “Hospitalize”, “more tests needed”, etc}

41 Given x, the expected loss (risk) associated with taking action i.
Conditional Risk Given x, the expected loss (risk) associated with taking action i.

42 0/1 Loss Function

43 Decision Bayesian Decision Rule:
A general decision rule is a function (x) that tells us which action to take for every possible observation. Bayesian Decision Rule: We shall now show that this Bayes decision procedure actually provides the optimal performance on an overall risk.

44 Overall Risk Bayesian decision rule:
The overall risk is given by: Compute the conditional risk for every  and select the action that minimizes R(i|x). This is denoted R*, and is referred to as the Bayes risk The Bayes risk is the best performance that can be achieved. Decision function If we choose (x) so that R(i(x)) is as small as possible for every x, the overall risk will be minimized. Clearly, if α(x) is chosen so that R(αi(x)) is as small as possible for every x, then the overall risk will be minimized. Bayesian decision rule: the optimal one to minimize the overall risk Its resulting overall risk is called the Bayesian risk

45 Two-Category Classification
Let 1 correspond to 1, 2 to 2, and ij = (i|j) Action State of Nature Loss Function The conditional risk is given by:

46 Two-Category Classification
Our decision rule is: choose 1 if: R(1|x) < R(2|x); otherwise decide 2 Two-Category Classification Perform 1 if R(2|x) > R(1|x); otherwise perform 2

47 Two-Category Classification
Perform 1 if R(2|x) > R(1|x); otherwise perform 2 positive positive Posterior probabilities are scaled before comparison. If the loss incurred for making an error is greater than that incurred for being correct, the factors (21- 11) and (12- 22) are positive, and the ratio of these factors simply scales the posteriors.

48 Two-Category Classification
irrelevant Perform 1 if R(2|x) > R(1|x); otherwise perform 2 Thus the Bayes decision rule can be interpreted ratio as calling for deciding ω1 if the likelihood ratio exceeds a threshold value that is independent of the observation x. By employing Bayes formula, we can replace the posteriors by the prior probabilities and conditional densities:

49 Two-Category Classification
This slide will be recalled later. Stated as: Choose a1 if the likelihood ration exceeds a threshold value independent of the observation x. Threshold Likelihood Ratio Perform 1 if

50

51 Loss Factors If the loss factors are identical, and the prior probabilities are equal, this reduces to a standard likelihood ratio:

52 The conditional risk is the average probability of error.
MINIMUM ERROR RATE Error rate (the probability of error) is to be minimized Consider a symmetrical or zero-one loss function: The conditional risk is: This loss function assigns no loss to a correct decision, and assigns a unit loss to any error; thus, all errors are equally costly.∗ The conditional risk is the average probability of error. To minimize error, maximize P(i|x) — also known as maximum a posteriori decoding (MAP).

53 Minimum error rate classification:
LIKELIHOOD RATIO Minimum error rate classification: choose i if: P(i| x) > P(j| x) for all ji Notice that this leads to the same decision boundaries as in Fig. 2.2, as it must. If we penalize mistakes in classifying ω1 patterns as ω2 more than the converse (i.e., λ21 > λ12), then Eq. 17 leads to the threshold θb marked. Note that the range of x values for which we classify a pattern as ω1 gets smaller, as it should. Figure 2.3: The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown in Fig. 2.1. If we employ a zero-one or classification loss, our decision boundaries are determined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1 patterns more than the converse, (i.e., λ12 > λ21), we get the larger threshold θb, and hence R1 becomes smaller.

54 for salmon population x is distributed according to N(10,1);
Example For sea bass population, the lightness x is a normal random variable distributes according to N(4,1); for salmon population x is distributed according to N(10,1); Select the optimal decision where: The two fish are equiprobable P(sea bass) = 2X P(salmon) The cost of classifying a fish as a salmon when it truly is seabass is 2$, and t The cost of classifying a fish as a seabass when it is truly a salmon is 1$. 2

55

56

57

58 End of Section 1

59 The Multicategory Classification
How to define discriminant functions? How do we represent pattern classifiers? The most common way is through discriminant functions. Remember we use {w1,w2, …, wc} to be the possible states of nature. For each class we create a discriminant function gi(x). gi(x)’s are called the discriminant functions. g1(x) x Action (e.g., classification) (x) g2(x) Our classifier is a network or machine that computes c discriminant functions. gc(x) The classifier Assign x to i if gi(x) > gj(x) for all j  i.

60 Simple Discriminant Functions
If f(.) is a monotonically increasing function, than f(gi(.) )’s are also be discriminant functions. Notice the decision is the same if we change every gi(x) for f(gi(x)) Assuming f(.) is a monotonically increasing function. Minimum Risk case: Minimum Error-Rate case:

61 Figure 2.5

62 Decision Regions Two-category example
The net effect is to divide the feature space into c regions (one for each class). We then have c decision regions separated by decision boundaries. Two-category example Decision regions are separated by decision boundaries.

63 Figure 2.6

64 Bayesian Decision Theory (Classification)
The Normal Distribution

65 Basics of Probability Discrete random variable (X) - Assume integer
Probability mass function (pmf): Cumulative distribution function (cdf): Continuous random variable (X) Probability density function (pdf): not a probability Cumulative distribution function (cdf):

66 Probability mass function
The graph of a probability mass function. All the values of this function must be non-negative and sum up to 1.

67 Probability density function
The pdf can be calculated by taking the integral of the function f(x) by the integration interval of the input variable x. For example: the probability of the variable X being within the interval [4.3,7.8] would be

68 Expectations Let g be a function of random variable X. The kth moment
The 1st moment The kth central moments

69 Important Expectations
Fact: Mean Variance

70 Entropy The entropy measures the fundamental uncertainty in the value of points selected randomly from a distribution.

71 Univariate Gaussian Distribution
Properties: Maximize the entropy Central limit theorem X~N(μ,σ2) x p(x) μ σ E[X] =μ Var[X] =σ2

72 Illustration of the central limit theorem
Probability density function of the sum of two terms Probability density function of the sum of three terms Probability density function of the sum of four terms Let x1,x2,…,xn be a sequence of n independent and identically distributed random variables having each finite values of expectation µ and variance σ2>0 The central limit theorem states that as the sample size n increases, the distribution of the sample average of these random variables approaches the normal distribution with a mean µ and variance σ2 / n irrespective of the shape of the original distribution. Original probability density function A probability density function

73 Random Vectors A d-dimensional random vector Vector Mean:
Covariance Matrix:

74 Multivariate Gaussian Distribution
X~N(μ,Σ) A d-dimensional random vector E[X] =μ E[(X-μ) (X-μ)T] =Σ

75 Properties of N(μ,Σ) Y~N(ATμ, ATΣA) X~N(μ,Σ)
A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA)

76 Properties of N(μ,Σ) Y~N(ATμ, ATΣA) X~N(μ,Σ)
A d-dimensional random vector Let Y=ATX, where A is a d × k matrix. Y~N(ATμ, ATΣA)

77 On Parameters of N(μ,Σ) X~N(μ,Σ)

78 More On Covariance Matrix
 is symmetric and positive semidefinite. : orthonormal matrix, whose columns are eigenvectors of . : diagonal matrix (eigenvalues).

79 Whitening Transform X~N(μ,Σ) Y=ATX Y~N(ATμ, ATΣA) Let

80 Whitening Transform X~N(μ,Σ) Y=ATX Y~N(ATμ, ATΣA) Let Whitening Linear
Projection

81 Whitening Transform The whitening transformation is a decorrelation method that converts the covariance matrix S of a set of samples into the identity matrix I. This effectively creates new random variables that are uncorrelated and have the same variances as the original random variables. The method is called the whitening transform because it transforms the input matrix closer towards white noise. This can be expressed as where Φ is the matrix with the eigenvectors of "S" as its columns and Λ is the diagonal matrix of non-increasing eigenvalues.

82 White noise White noise is a random signal (or process) with a flat power spectral density. In other words, the signal contains equal power within a fixed bandwidth at any center frequency. Energy spectral density

83 Mahalanobis Distance X~N(μ,Σ) r2 constant depends on the value of r2

84 Mahalanobis distance In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analyzed. It is a useful way of determining similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant, i.e. not dependent on the scale of measurements .

85 Mahalanobis distance Formally, the Mahalanobis distance from a group of values with above mean and covariance matrix Σ for a multivariate vector is defined as: Mahalanobis distance can also be defined as dissimilarity measure between two random vectors of the same distribution with the covariance matrix Σ : If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance: where σi is the standard deviation of the xi over the sample set.

86 Mahalanobis Distance X~N(μ,Σ) r2 constant depends on the value of r2

87 Bayesian Decision Theory (Classification)
Discriminant Functions for the Normal Populations

88 Normal Density If features are statistically independent and the variance is the same for all features, the discriminant function is simple and is linear in nature. A classifier that uses linear discriminant functions is called a linear machine. The decision surface are pieces of hyperplanes defined by linear equations.

89 Minimum-Error-Rate Classification
Assuming the measurements are normally distributed, we have Xi~N(μi,Σi)

90 Some Algebra to Simplify the Discriminants
Since We take the natural logarithm to re-write the first term

91 Some Algebra to Simplify the Discriminants (continued)

92 Minimum-Error-Rate Classification
Three Cases: Case 1: Classes are centered at different mean, and their feature components are pairwisely independent have the same variance. Case 2: Classes are centered at different mean, but have the same variation. Case 3: Arbitrary.

93 Case 1. i = 2I irrelevant irrelevant

94 Case 1. i = 2I

95 Case 1. i = 2I Boundary btw. i and j

96 Case 1. i = 2I x xx0 x0 w 0 if P(i)=P(j) Boundary btw. i and j
The decision boundary will be a hyperplane perpendicular to the line btw. the means at somewhere. x xx0 w 0 if P(i)=P(j) x0 midpoint Boundary btw. i and j wT

97 Case 1. i = 2I Minimum distance classifier (template matching)
The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance). Minimum distance classifier (template matching)

98

99 Case 1. i = 2I Note how priors shift the boundary away from the more likely mean !!!

100 Case 1. i = 2I

101 Case 1. i = 2I

102 Case 2. i =  Mahalanobis Distance Irrelevant if P(i)= P(j) i, j
Covariance matrices are arbitrary, but equal to each other for all classes. Features then form hyper-ellipsoidal clusters of equal size and shape. Mahalanobis Distance Irrelevant if P(i)= P(j) i, j irrelevant

103 Case 2. i =  Irrelevant

104 Case 2. i =  x w x0 The discriminant hyperplanes are often not
orthogonal to the segments joining the class means x0

105 Case 2. i = 

106 Case 2. i = 

107 Case 3. i   j Without this term In Case 1 and 2 irrelevant
The covariance matrices are different for each category In two class case, the decision boundaries form hyperquadratics. Decision surfaces are hyperquadrics, e.g., hyperplanes hyperspheres hyperellipsoids hyperhyperboloids Without this term In Case 1 and 2 irrelevant

108 Case 3. i   j Non-simply connected decision regions can arise in one dimensions for Gaussians having unequal variance.

109 Case 3. i   j

110 Case 3. i   j

111 Case 3. i   j

112 Multi-Category Classification

113 Example : A Problem Exemplars (transposed)
For w1 = {(2, 6), (3, 4), (3, 8), (4, 6)} For w2 = {(1, -2), (3, 0), (3, -4), (5, -2)} Calculated means (transposed) m1 = (3, 6) m2 = (3, -2)

114 Example : Covariance Matrices

115 Example : Covariance Matrices

116 Example: Inverse and Determinant for Each of the Covariance Matrices

117 Example : A Discriminant Function for Class 1

118 Example

119 Example : A Discriminant Function for Class 2

120 Example

121 Example : The Class Boundary

122 Example : A Quadratic Separator

123 Example : Plot of the Discriminant

124 Summary Steps for Building a Bayesian Classifier
Collect class exemplars Estimate class a priori probabilities Estimate class means Form covariance matrices, find the inverse and determinant for each Form the discriminant function for each class

125 Using the Classifier Obtain a measurement vector x
Evaluate the discriminant function gi(x) for each class i = 1,…,c Decide x is in the class j if gj(x) > gi(x) for all i  j

126 Bayesian Decision Theory (Classification)
Criterions

127 Bayesian Decision Theory (Classification)
Minimun error rate Criterion

128 Minimum-Error-Rate Classification
If action is taken and the true state is , then the decision is correct if and in error if Error rate (the probability of error) is to be minimized Symmetrical or zero-one loss function Conditional risk

129 Minimum-Error-Rate Classification

130 Bayesian Decision Theory (Classification)
Minimax Criterion

131 Bayesian Decision Rule: Two-Category Classification
Threshold Likelihood Ratio Decide 1 if Minimax criterion deals with the case that the prior probabilities are unknown.

132 Basic Concept on Minimax
To choose the worst-case prior probabilities (the maximum loss) and, then, pick the decision rule that will minimize the overall risk. Minimize the maximum possible overall risk. So that the worst risk for any value of the priors is as small as possible

133 Overall Risk

134 Overall Risk

135 Overall Risk

136 Overall Risk

137 Overall Risk R(x) = ax + b The overall risk for a particular P(1).
The value depends on the setting of decision boundary The value depends on the setting of decision boundary The overall risk for a particular P(1).

138 Overall Risk R(x) = ax + b Independent on the value of P(i).
= 0 for minimax solution = Rmm, minimax risk Independent on the value of P(i).

139 Minimax Risk

140 Error Probability Use 0/1 loss function

141 Minimax Error-Probability
Use 0/1 loss function P(1|2) P(2|1)

142 Minimax Error-Probability
1 2

143 Minimax Error-Probability

144 Bayesian Decision Theory (Classification)
Neyman-Pearson Criterion

145 Bayesian Decision Rule: Two-Category Classification
Threshold Likelihood Ratio Decide 1 if Neyman-Pearson Criterion deals with the case that both loss functions and the prior probabilities are unknown.

146 Signal Detection Theory
The theory of signal detection theory evolved from the development of communications and radar equipment the first half of the last century. It migrated to psychology, initially as part of sensation and perception, in the 50's and 60's as an attempt to understand some of the features of human behavior when detecting very faint stimuli that were not being explained by traditional theories of thresholds.

147 The situation of interest
A person is faced with a stimulus (signal) that is very faint or confusing. The person must make a decision, is the signal there or not.  What makes this situation confusing and difficult is the presences of other mess that is similar to the signal.  Let us call this mess noise.

148 Example Noise is present both in the environment and in the sensory system of the observer. The observer reacts to the momentary total activation of the sensory system, which fluctuates from moment to moment, as well as responding to environmental stimuli, which may include a signal.

149 Signal Detection Theory
Suppose we want to detect a single pulse from a signal. We assume the signal has some random noise. When the signal is present we observe a normal distribution with mean u2. When the signal is not present we observe a normal distribution with mean u1. We assume same standard deviation. Can we measure the discriminability of the problem? Can we do this independent of the threshold x*? Discriminability: d’ = | u2 – u1 | / σ

150 Example A radiologist is examining a CT scan, looking for evidence of a tumor. A Hard job, because there is always some uncertainty. There are four possible outcomes: hit (tumor present and doctor says "yes'') miss (tumor present and doctor says "no'') false alarm (tumor absent and doctor says "yes") correct rejection (tumor absent and doctor says "no"). Two types of Error

151 The Four Cases Signal (tumor) Absent (1) Present (2) No (1)
Signal detection theory was developed to help us understand how a continuous and ambiguous signal can lead to a binary yes/no decision. The Four Cases Signal (tumor) Absent (1) Present (2) Decision No (1) Yes (2) Correct Rejection Miss P(1|1) P(1|2) False Alarms Hit P(2|1) P(2|2)

152 Decision Making Discriminability Criterion Based on expectancy
(decision bias) d’ Noise + Signal 2 Noise 1 Hit P(2|2) False Alarm P(2|1) No (1) Yes (2)

153

154 Signal Detection Theory
How do we find d’ if we do not know u1, u2, or x*? From the data we can compute: P( x > x* | w2) a hit. P( x > x* | w1) a false alarm. P( x < x* | w2) a miss. P( x < x* | w1) a correct rejection. If we plot a point in a space representing hit and false alarm rates, then we have a ROC (receiver operating characteristic) curve. With it we can distinguish between discriminability and bias.

155 ROC Curve (Receiver Operating Characteristic)
Hit PH=P(2|2) False Alarm PFA=P(2|1)

156 Neyman-Pearson Criterion
Hit PH=P(2|2) NP: max. PH subject to PFA ≦ a False Alarm PFA=P(2|1)


Download ppt "Bayesian Decision Theory"

Similar presentations


Ads by Google