1 Pattern Recognition Chapter 2 Bayesian Decision Theory.

Slides:



Advertisements
Similar presentations
CS479/679 Pattern Recognition Dr. George Bebis
Advertisements

Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Bayesian Decision Theory
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Bayesian Belief Networks
Decision Theory Naïve Bayes ROC Curves
Linear Discriminant Functions Chapter 5 (Duda et al.)
Inferences About Process Quality
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
Institute of Systems and Robotics ISR – Coimbra Mobile Robotics Lab Bayesian Approaches 1 1 jrett.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Principles of Pattern Recognition
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
URL:.../publications/courses/ece_8443/lectures/current/exam/2004/ ECE 8443 – Pattern Recognition LECTURE 15: EXAM NO. 1 (CHAP. 2) Spring 2004 Solutions:
1 E. Fatemizadeh Statistical Pattern Recognition.
Optimal Bayes Classification
Bayesian Decision Theory (Classification) 主講人:虞台文.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
Special Topics In Scientific Computing
LECTURE 03: DECISION SURFACES
Comp328 tutorial 3 Kai Zhang
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Advanced Pattern Recognition
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Image Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
LECTURE 11: Exam No. 1 Review
Presentation transcript:

1 Pattern Recognition Chapter 2 Bayesian Decision Theory

2 Fundamental statistical approach to problem classification.Fundamental statistical approach to problem classification. Quantifies the tradeoffs between various classification decisions using probabilities and the costs associated with such decisions.Quantifies the tradeoffs between various classification decisions using probabilities and the costs associated with such decisions. –Each action is associated with a cost or risk. –The simplest risk is the classification error. –Design classifiers to recommend actions that minimize some total expected risk.

3 Terminology (using sea bass – salmon classification example) State of nature ω (random variable):State of nature ω (random variable): –ω 1 for sea bass, ω 2 for salmon. Probabilities P(ω 1 ) and P(ω 2 ) (priors)Probabilities P(ω 1 ) and P(ω 2 ) (priors) –prior knowledge of how likely is to get a sea bass or a salmon Probability density function p(x) (evidence):Probability density function p(x) (evidence): –how frequently we will measure a pattern with feature value x (e.g., x is a lightness measurement) Note: if x and y are different measurements, p(x) and p(y) correspond to different pdfs: p X (x) and p Y (y)

4 Terminology (cont’d) (using sea bass – salmon classification example) Conditional probability density p(x/ω j ) (likelihood):Conditional probability density p(x/ω j ) (likelihood): –how frequently we will measure a pattern with feature value x given that the pattern belongs to class ω j e.g., lightness distributions between salmon/sea-bass populations

5 Terminology (cont’d) (using sea bass – salmon classification example) Conditional probability P(ω j /x) (posterior):Conditional probability P(ω j /x) (posterior): –the probability that the fish belongs to class ω j given measurement x. Note: we will be using an uppercase P(.) to denote a probability mass function (pmf) and a lowercase p(.) to denote a probability density function (pdf).

6 Decision Rule Using Priors Only Decide ω 1 if P(ω 1 ) > P(ω 2 ); otherwise decide ω 2 Decide ω 1 if P(ω 1 ) > P(ω 2 ); otherwise decide ω 2 P(error) = min[P(ω 1 ), P(ω 2 )] P(error) = min[P(ω 1 ), P(ω 2 )] Favours the most likely class … (optimum if no other info is available).Favours the most likely class … (optimum if no other info is available). This rule would be making the same decision all the times!This rule would be making the same decision all the times! Makes sense to use for judging just one fish …Makes sense to use for judging just one fish …

7 Decision Rule Using Conditional pdf Using Bayes’ rule, the posterior probability of category ω j given measurement x is given by:Using Bayes’ rule, the posterior probability of category ω j given measurement x is given by: where (scale factor – sum of probs = 1) where (scale factor – sum of probs = 1) Decide ω 1 if P(ω 1 /x) > P(ω 2 /x); otherwise decide ω 2 or Decide ω 1 if P(ω 1 /x) > P(ω 2 /x); otherwise decide ω 2 or Decide ω 1 if p(x/ω 1 )P(ω 1 )>p(x/ω 2 )P(ω 2 ) otherwise decide ω 2 Decide ω 1 if p(x/ω 1 )P(ω 1 )>p(x/ω 2 )P(ω 2 ) otherwise decide ω 2

8 Decision Rule Using Conditional pdf (cont’d)

9 Probability of Error The probability of error is defined as:The probability of error is defined as: The average probability error is given by:The average probability error is given by: The Bayes rule is optimum, that is, it minimizes the average probability error since:The Bayes rule is optimum, that is, it minimizes the average probability error since: P(error/x) = min[P(ω 1 /x), P(ω 2 /x)]

10 Where do Probabilities Come From? The Bayesian rule is optimal if the pmf or pdf is known.The Bayesian rule is optimal if the pmf or pdf is known. There are two competitive answers to the above question:There are two competitive answers to the above question: (1) Relative frequency (objective) approach. –Probabilities can only come from experiments. (2) Bayesian (subjective) approach. –Probabilities may reflect degree of belief and can be based on opinion as well as experiments.

11 Example Classify cars on UNR campus whether they are more or less than $50K:Classify cars on UNR campus whether they are more or less than $50K: –C1: price > $50K –C2: price < $50K –Feature x: height of car From Bayes’ rule, we know how to compute the posterior probabilities:From Bayes’ rule, we know how to compute the posterior probabilities: Need to compute p(x/C 1 ), p(x/C 2 ), P(C 1 ), P(C 2 )Need to compute p(x/C 1 ), p(x/C 2 ), P(C 1 ), P(C 2 )

12 Example (cont’d) Determine prior probabilitiesDetermine prior probabilities –Collect data: ask drivers how much their car was and measure height. –e.g., 1209 samples: #C 1 =221 #C 2 =988

13 Example (cont’d) Determine class conditional probabilities (likelihood)Determine class conditional probabilities (likelihood) –Discretize car height into bins and use normalized histogram

14 Example (cont’d) Calculate the posterior probability for each bin:Calculate the posterior probability for each bin:

15 A More General Theory Use more than one features.Use more than one features. Allow more than two categories.Allow more than two categories. Allow actions other than classifying the input to one of the possible categories (e.g., rejection).Allow actions other than classifying the input to one of the possible categories (e.g., rejection). Introduce a more general error functionIntroduce a more general error function –loss function (i.e., associate “costs” with actions)

16 Terminology Features form a vectorFeatures form a vector A finite set of c categories ω 1, ω 2, …, ω cA finite set of c categories ω 1, ω 2, …, ω c A finite set l of actions α 1, α 2, …, α lA finite set l of actions α 1, α 2, …, α l A loss function λ(α i / ω j )A loss function λ(α i / ω j ) –the loss incurred for taking action α i when the classification category is ω j Bayes rule using vector notationBayes rule using vector notation

17 Expected Loss (Conditional Risk) Expected loss (or conditional risk) with taking action α i :Expected loss (or conditional risk) with taking action α i : The expected loss can be minimized by selecting the action that minimizes the conditional risk.The expected loss can be minimized by selecting the action that minimizes the conditional risk.

18 Overall Risk Overall risk ROverall risk R where α(x) determines which action α 1, α 2, …, α l to take for every x (i.e., α(x) is a decision rule). where α(x) determines which action α 1, α 2, …, α l to take for every x (i.e., α(x) is a decision rule). To minimize R, find a decision rule α(x) that chooses the action with the minimum conditional risk R(a i /x) for every x.To minimize R, find a decision rule α(x) that chooses the action with the minimum conditional risk R(a i /x) for every x. This rule yields optimal performance…This rule yields optimal performance… conditional risk

19 Bayes Decision Rule The Bayes decision rule minimizes R by:The Bayes decision rule minimizes R by: –Computing R(α i /x) for every α i given an x –Choosing the action α i with the minimum R(α i /x) Bayes risk (i.e., resulting minimum) is the best performance that can be achieved.Bayes risk (i.e., resulting minimum) is the best performance that can be achieved.

20 Example: Two-category classification Two possible actionsTwo possible actions –α 1 corresponds to deciding ω 1 –α 2 corresponds to deciding ω 2 Notation:Notation: λ ij= λ(α i,ω j ) The conditional risks are:The conditional risks are:

21 Example: Two-category classification Decision rule:Decision rule: or (i.e., using likelihood ratio) or >

22 Special Case: Zero-One Loss Function It assigns the same loss to all errors:It assigns the same loss to all errors: The conditional risk corresponding to this loss function:The conditional risk corresponding to this loss function:

23 Special Case: Zero-One Loss Function (cont’d) The decision rule becomes:The decision rule becomes: What is the overall risk in this case?What is the overall risk in this case? (answer: average probability error) (answer: average probability error) or

24 Example θ a was determined assuming zero-one loss functionθ a was determined assuming zero-one loss function θ b was determined assumingθ b was determined assuming (decision regions)

25 Discriminant Functions Functional structure of a general statistical classifierFunctional structure of a general statistical classifier Assign x to ω i if: g i (x) > g j (x) for all Assign x to ω i if: g i (x) > g j (x) for all ( discriminant functions ) pick max

26 Discriminants for Bayes Classifier Using risks:Using risks: g j (x)=-R(α j /x) g j (x)=-R(α j /x) Using zero-one loss function (i.e., min error rate):Using zero-one loss function (i.e., min error rate): g j (x)=P(ω j /x) g j (x)=P(ω j /x) Is the choice of g j unique?Is the choice of g j unique? –Replacing g j (x) with f(g j (x)), where f() is monotonically increasing, does not change the classification results.

27 Decision Regions and Boundaries Decision rules divide the feature space in decision regions R 1, R 2, …, R cDecision rules divide the feature space in decision regions R 1, R 2, …, R c The boundaries of the decision regions are the decision boundaries.The boundaries of the decision regions are the decision boundaries. g 1 (x)=g 2 (x) at the decision boundaries

28 Case of two categories More common to use a single discriminant function (dichotomizer) instead of two:More common to use a single discriminant function (dichotomizer) instead of two: Examples of dichotomizers:Examples of dichotomizers:

29 Discriminant Function for Multivariate Gaussian Assume the following discriminant function:Assume the following discriminant function: N(μ,Σ) p(x/ω i )

30 Multivariate Gaussian Density: Case I Assumption: Σ i =σ 2Assumption: Σ i =σ 2 –Features are statistically independent –Each feature has the same variance favors the a-priori more likely category

31 Multivariate Gaussian Density: Case I (cont’d) wi=wi= threshold or bias ) )

32 Multivariate Gaussian Density: Case I (cont’d) Comments about this hyperplane:Comments about this hyperplane: –It passes through x 0 –It is orthogonal to the line linking the means. –What happens when P(ω i )= P(ω j ) ? –If P(ω i )= P(ω j ), then x 0 shifts away from the more likely mean. –If σ is very small, the position of the boundary is insensitive to P(ω i ) and P(ω j )

33 Multivariate Gaussian Density: Case I (cont’d)

34 Multivariate Gaussian Density: Case I (cont’d) Minimum distance classifierMinimum distance classifier –When P(ω i ) is the same for each of the c classes

35 Multivariate Gaussian Density: Case II Assumption: Σ i = ΣAssumption: Σ i = Σ

36 Multivariate Gaussian Density: Case II (cont’d) Comments about this hyperplane:Comments about this hyperplane: –It passes through x 0 –It is NOT orthogonal to the line linking the means. –What happens when P(ω i )= P(ω j ) ? –If P(ω i )= P(ω j ), then x 0 shifts away from the more likely mean.

37 Multivariate Gaussian Density: Case II (cont’d)

38 Multivariate Gaussian Density: Case II (cont’d) Mahalanobis distance classifierMahalanobis distance classifier –When P(ω i ) is the same for each of the c classes

39 Multivariate Gaussian Density: Case III Assumption: Σ i = arbitraryAssumption: Σ i = arbitrary e.g., hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids etc. hyperquadrics

40 Multivariate Gaussian Density: Case III (cont’d) disconnecteddecisionregions P(ω 1 )=P(ω 2 )

41 Multivariate Gaussian Density: Case III (cont’d) disconnecteddecisionregions non-lineardecisionboundaries

42 Multivariate Gaussian Density: Case III (cont’d) More examples (Σ arbitrary)More examples (Σ arbitrary)

43 Multivariate Gaussian Density: Case III (cont’d) A four category exampleA four category example

44 Example - Case III P(ω 1 )=P(ω 2 ) decision boundary: boundary does not pass through midpoint of μ 1,μ 2

45 Missing Features Suppose x=(x 1,x 2 ) is a feature vector.Suppose x=(x 1,x 2 ) is a feature vector. What can we do when x 1 is missing during classification?What can we do when x 1 is missing during classification? –Maybe use the mean value of all x 1 measurements? –But is the largest!

46 Missing Features (cont ’ d) Suppose x=[x g,x b ] (x g : good features, x b : bad features)Suppose x=[x g,x b ] (x g : good features, x b : bad features) Derive the Bayes rule using the good features:Derive the Bayes rule using the good features: p p

47 Noisy Features Suppose x=[x g,x b ] (x g : good features, x b : noisy features)Suppose x=[x g,x b ] (x g : good features, x b : noisy features) Suppose noise is statistically independent and …Suppose noise is statistically independent and … –We know noise model p(x b /x t ) –x b : observed feature values, x t : true feature values. Assume statistically independent noiseAssume statistically independent noise –if x t were known, x b would be independent of x g, ω i p

48 Noisy Features (cont ’ d) use independence assumption What happens when p(x b /x t ) is uniform?What happens when p(x b /x t ) is uniform?

49 Receiver Operating Characteristic (ROC) Curve Every classifier employs some kind of a threshold value.Every classifier employs some kind of a threshold value. Changing the threshold affects the performance of the system.Changing the threshold affects the performance of the system. ROC curves can help us distinguish between discriminability and decision bias (i.e., choice of threshold)ROC curves can help us distinguish between discriminability and decision bias (i.e., choice of threshold)

50 Example: Person Authentication Authenticate a person using biometrics (e.g., face image).Authenticate a person using biometrics (e.g., face image). There are two possible distributions:There are two possible distributions: –authentic (A) and impostor (I) I A false positive correct acceptance correct rejection false rejection

51 Example: Person Authentication (cont’d) Possible casesPossible cases –(1) correct acceptance (true positive): X belongs to A, and we decide AX belongs to A, and we decide A –(2) incorrect acceptance (false positive): X belongs to I, and we decide AX belongs to I, and we decide A –(3) correct rejection (true negative): X belongs to I, and we decide IX belongs to I, and we decide I –(4) incorrect rejection (false negative): X belongs to A, and we decide IX belongs to A, and we decide I

52 Error vs Threshold x*x*x*x*

53 False Negatives vs Positives

54 Statistical Dependences Between Variables Many times, the only knowledge we have about a distribution is which variables are or are not dependent.Many times, the only knowledge we have about a distribution is which variables are or are not dependent. Such dependencies can be represented graphically using a Bayesian Belief Network (or Belief Net).Such dependencies can be represented graphically using a Bayesian Belief Network (or Belief Net). In essence, Bayesian Nets allow us to represent a joint probability density p(x,y,z,…) efficiently using dependency relationships.In essence, Bayesian Nets allow us to represent a joint probability density p(x,y,z,…) efficiently using dependency relationships. p(x,y,z,…) could be either discrete or continuous.p(x,y,z,…) could be either discrete or continuous.

55 Example of Dependencies State of an automobileState of an automobile –Engine temperature –Brake fluid pressure –Tire air pressure –Wire voltages –Etc. NOT causally related variablesNOT causally related variables –Engine oil pressure –Tire air pressure Causally related variablesCausally related variables –Coolant temperature –Engine temperature

56 Definitions and Notation A belief net is usually a Directed Acyclic Graph (DAG)A belief net is usually a Directed Acyclic Graph (DAG) Each node represents one of the system variables.Each node represents one of the system variables. Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous).Each variable can assume certain values (i.e., states) and each state is associated with a probability (discrete or continuous).

57 Relationships Between Nodes A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X)A link joining two nodes is directional and represents a causal influence (e.g., X depends on A or A influences X) Influences could be direct or indirect (e.g., A influences X directly and A influences C indirectly through X).Influences could be direct or indirect (e.g., A influences X directly and A influences C indirectly through X).

58 Parent/Children Nodes Parent nodes P of XParent nodes P of X –the nodes before X (connected to X) Children nodes C of X:Children nodes C of X: –the nodes after X (X is connected to them)

59 Conditional Probability Tables Every node is associated with a set of weights which represent the prior/conditional probabilities (e.g., P(x i /a j ), i=1,2, j=1,2,3,4)Every node is associated with a set of weights which represent the prior/conditional probabilities (e.g., P(x i /a j ), i=1,2, j=1,2,3,4) probabilities sum to 1

60 Learning There exist algorithms for learning these probabilities from data…There exist algorithms for learning these probabilities from data…

61 Computing Joint Probabilities We can compute the probability of any configuration of variables in the joint density distribution:We can compute the probability of any configuration of variables in the joint density distribution: e.g., P(a 3, b 1, x 2, c 3, d 2 )=P(a 3 )P(b 1 )P(x 2 /a 3,b 1 )P(c 3 /x 2 )P(d 2 /x 2 )= 0.25 x 0.6 x 0.4 x 0.5 x 0.4 = x 0.6 x 0.4 x 0.5 x 0.4 = 0.012

62 Computing the Probability at a Node E.g., determine the probability at DE.g., determine the probability at D

63 Computing the Probability at a Node (cont’d) E.g., determine the probability at H:E.g., determine the probability at H: =

64 Computing Probability Given Evidence (Bayesian Inference) Determine the probability of some particular configuration of variables given the values of some other variables (evidence).Determine the probability of some particular configuration of variables given the values of some other variables (evidence). e.g., compute P(b 1 /a 2, x 1, c 1 ) e.g., compute P(b 1 /a 2, x 1, c 1 )

65 Computing Probability Given Evidence (Bayesian Inference)(cont’d) In general, if X denotes the query variables and e denotes the evidence, thenIn general, if X denotes the query variables and e denotes the evidence, then where α=1/P(e) is a constant of proportionality.

66 An Example Classify a fish given that we only have evidence that the fish is light (c 1 ) and was caught in south Atlantic (b 2 ) -- no evidence about what time of the year the fish was caught nor its thickness.Classify a fish given that we only have evidence that the fish is light (c 1 ) and was caught in south Atlantic (b 2 ) -- no evidence about what time of the year the fish was caught nor its thickness.

67 An Example (cont’d)

68 An Example (cont’d)

69 An Example (cont’d) Similarly,Similarly, P(x 2 /c 1,b 2 )=α P(x 2 /c 1,b 2 )=α Normalize probabilities (not needed necessarily):Normalize probabilities (not needed necessarily): P(x 1 /c 1,b 2 )+ P(x 2 /c 1,b 2 )=1 (α=1/0.18) P(x 1 /c 1,b 2 )+ P(x 2 /c 1,b 2 )=1 (α=1/0.18) P(x 1 /c 1,b 2 )= 0.63 P(x 1 /c 1,b 2 )= 0.63 P(x 2 /c 1,b 2 )= 0.27 P(x 2 /c 1,b 2 )= 0.27 salmon

70 Another Example: Medical Diagnosis Uppermost nodes: biological agents (bacteria, virus) Intermediate nodes: diseases Lowermost nodes: symptoms Given some evidence (biological agents, symptoms), find most likely disease.Given some evidence (biological agents, symptoms), find most likely disease.

71 Naïve Bayes’ Rule When dependency relationships among features are unknown, we can assume that features are conditionally independent given the category:When dependency relationships among features are unknown, we can assume that features are conditionally independent given the category:P(a,b/x)=P(a/x)P(b/x) Naïve Bayes rule :Naïve Bayes rule : Simple assumption but … usually works well in practice.Simple assumption but … usually works well in practice.