Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
PROBABILITY. Uncertainty  Let action A t = leave for airport t minutes before flight from Logan Airport  Will A t get me there on time ? Problems :
Data Mining Classification: Naïve Bayes Classifier
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Bayesian networks practice. Semantics e.g., P(j  m  a   b   e) = P(j | a) P(m | a) P(a |  b,  e) P(  b) P(  e) = … Suppose we have the variables.
CPSC 422 Review Of Probability Theory.
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.
Data Mining with Naïve Bayesian Methods
Review. Belief and Probability The connection between toothaches and cavities is not a logical consequence in either direction. However, we can provide.
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Ai in game programming it university of copenhagen Welcome to... the Crash Course Probability Theory Marco Loog.
Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review (Chapter 13)
Bayesian networks practice. Semantics e.g., P(j  m  a   b   e) = P(j | a) P(m | a) P(a |  b,  e) P(  b) P(  e) = … Suppose we have the variables.
Uncertainty Logical approach problem: we do not always know complete truth about the environment Example: Leave(t) = leave for airport t minutes before.
Algorithms for Classification: The Basic Methods.
AI Principles, Lecture on Reasoning Under Uncertainty Reasoning Under Uncertainty (A statistical approach) Jeremy Wyatt.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Methods in Computational Linguistics II Queens College Lecture 2: Counting Things.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider:  x Symptom(x, Toothache)  Disease(x, Cavity). The problem is that this.
Uncertainty Chapter 13. Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability.
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
1 Chapter 13 Uncertainty. 2 Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Probability and naïve Bayes Classifier Louis Oliphant cs540 section 2 Fall 2005.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
1 Reasoning Under Uncertainty Artificial Intelligence Chapter 9.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Uncertainty Management in Rule-based Expert Systems
Uncertainty. Assumptions Inherent in Deductive Logic-based Systems All the assertions we wish to make and use are universally true. Observations of the.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Uncertainty Chapter 13. Outline Uncertainty Probability Syntax and Semantics Inference Independence and Bayes' Rule.
Uncertainty Let action A t = leave for airport t minutes before flight Will A t get me there on time? Problems: 1.partial observability (road state, other.
CSE 473 Uncertainty. © UW CSE AI Faculty 2 Many Techniques Developed Fuzzy Logic Certainty Factors Non-monotonic logic Probability Only one has stood.
Uncertainty Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Uncertainty & Probability CIS 391 – Introduction to Artificial Intelligence AIMA, Chapter 13 Many slides adapted from CMSC 421 (U. Maryland) by Bonnie.
Anifuddin Azis UNCERTAINTY. 2 Introduction The world is not a well-defined place. There is uncertainty in the facts we know: What’s the temperature? Imprecise.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
AP Statistics From Randomness to Probability Chapter 14.
Data Science Algorithms: The Basic Methods
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Uncertainty.
Bayesian Classification
Naïve Bayes CSC 600: Data Mining Class 19.
Uncertainty in Environments
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Uncertainty Logical approach problem: we do not always know complete truth about the environment Example: Leave(t) = leave for airport t minutes before.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Naïve Bayes CSC 576: Data Science.
Uncertainty Chapter 13.
Uncertainty Chapter 13.
Presentation transcript:

Handling Uncertainty

Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true ? The problem is that this rule isn’t right always. –Not all patients with toothache have cavities; some of them have gum disease, an abscess, etc. We could try turning the rule into a causal rule: –if Cavity=true then Toothache=true But this rule isn’t necessarily right either; not all cavities cause pain. NameToothache…Cavity Smithtrue… Miketrue… Maryfalse…true Quincytrue…false …………

Belief and Probability The connection between toothaches and cavities is not a logical consequence in either direction. However, we can provide a degree of belief on the rules. Our main tool for this is probability theory. E.g. We might not know for sure what afflicts a particular patient, but we believe that there is, say, an 80% chance – that is probability 0.8 – that the patient has cavity if he has a toothache. –We usually get this belief from statistical data.

Syntax Basic element: random variable – corresponds to an “attribute” of the data. e.g., Cavity (do I have a cavity?) is one of Weather is one of Both Cavity and Weather are discrete random variables –Domain values must be exhaustive and mutually exclusive Elementary propositions are constructed by the assignment of a value to a random variable: e.g., Cavity =  cavity, Weather = sunny

Prior probability and distribution Prior or unconditional probability associated with a proposition is the degree of belief accorded to it in the absence of any other information. e.g., P(Cavity = cavity) = 0.1 (or abbrev. P(cavity) = 0.1) P(Weather = sunny) = 0.7(or abbrev. P(sunny) = 0.7) Probability distribution gives values for all possible assignments: P(Weather = sunny) = 0.7 P(Weather = rain) = 0.2 P(Weather = cloudy) = 0.08 P(Weather = snow) = 0.02

Conditional probability E.g., P(cavity | toothache) = 0.8 i.e., probability of cavity given that toothache is all I know It can be interpreted as the probability that the rule if Toothache=true then Cavity=true holds. Definition of conditional probability: P(a | b) = P(a  b) / P(b) if P(b) > 0 Product rule gives an alternative formulation: P(a  b) = P(a | b) P(b) = P(b | a) P(a)

Bayes' Rule Product rule P(a  b) = P(a | b) P(b) = P(b | a) P(a) Bayes' rule: P(a | b) = P(b | a) P(a) / P(b) Useful for assessing diagnostic probability from causal probability as: P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect) Bayes ’ s rule is useful in practice because there are many cases where we do have good probability estimates for these three numbers and need to compute the fourth.

Applying Bayes’ rule For example, A doctor knows that the meningitis causes the patient to have a stiff neck 50% of the time. The doctor also knows some unconditional facts: the prior probability that a patient has meningitis is 1/50,000, and the prior probability that any patient has a stiff neck is 1/20. So, what do we have in term of probabilities.

Bayes’ rule (cont’d) P(StiffNeck=true | Meningitis=true) = 0.5 P(Meningitis=true) = 1/50000 P(StiffNeck=true) = 1/20 P(Meningitis=true | StiffNeck=true) = P(StiffNeck=true | Meningitis=true) P(Meningitis=true) / P(StiffNeck=true) = (0.5) x (1/50000) / (1/20) = That is, we expect only 1 in 5000 patients with a stiff neck to have meningitis. This is still a very small chance. Reason is a very small apriori probability. Also, observe that P(Meningitis=false | StiffNeck=true) = P(StiffNeck=false | Meningitis=false) P(Meningitis=false) / P(StiffNeck=true) 1/ P(StiffNeck=true) is the same. It is called the normalization constant (denoted with  ).

Bayes’ rule (cont’d) Well, we might say that doctors know that a stiff neck implies meningitis in 1 out of 5000 cases; –That is the doctor has quantitative information in the diagnostic direction from symptoms (effects) to causes. –Such a doctor has no need for the Bayes’ rule?! Unfortunately, diagnostic knowledge is more fragile than causal knowledge. Imagine, there is sudden epidemic of meningitis. The prior probability, P( Meningitis=true ), will go up. –The doctor who derives the diagnostic probability P(Meningitis=true|StiffNeck=true) from his statistical observations of patients before the epidemic will have no idea how to update the value. –The doctor who derives the diagnostic probability from the other three values will see that P(Meningitis=true|StiffNeck=true) goes up proportionally with P(Meningitis=true). Clearly, P( StiffNeck=true|Meningitis=true ) is unaffected by the epidemic. It simply reflects the way meningitis works.

Bayes’ rule -- more vars Although the effect 1 might not be independent of effect 2, it might be that given the cause they are independent. E.g. effect 1 is ‘abilityInReading’ effect 2 is ‘lengthOfArms’ There is indeed a dependence of abilityInReading to lengthOfArms. People with longer arms read better than those with short arms…. However, given the cause ‘Age’ the abilityInReading is independent of lengthOfArms.

Conditional Independence – Naive Bayes Two assumptions: –Attributes are equally important –Conditionally independent (given the class value) This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) Although based on assumptions that are almost never correct, this scheme works well in practice!

Weather Data Here we don’t really have effects, but rather “evidence.”

Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? –Evidence E = instance or E 1,E 2,…, E n –Event H = class value for instance Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!) P(H | E 1,E 2,…, E n ) = P(E 1 |H) P(E 2 |H)… P(E n |H) P(H) / P(E 1,E 2,…, E n )

The weather data example P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (3/9) * (9/14) / P(E) = / P(E) Don’t worry for the 1/P(E); It’s alpha, the normalization constant.

The weather data example P(play=no | E) = P(Outlook=Sunny | play=no) * P(Temp=Cool | play=no) * P(Humidity=High | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (3/5) * (1/5) * (4/5) * (3/5) * (5/14) / P(E) = / P(E)

Normalization constant P(play=yes | E) + P(play=no | E) = 1i.e / P(E) / P(E) = 1i.e. P(E) = So, P(play=yes | E) = / ( ) = 20.5% P(play=no | E) = / ( ) = 79.5% E play=yes play=no 20.5% 79.5%

The “zero-frequency problem” What if an attribute value doesn’t occur with every class value (e.g. “Humidity = High” for class “Play=Yes”)? –Probability P(Humidity=High|play=yes) will be zero! P(Play=“Yes”|E) will also be zero! –No matter how likely the other values are! Remedy: –Add 1 to the count for every attribute value-class combination (Laplace estimator); –Add k (no of possible attribute values) to the denumerator. (see example on the right). P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) = / P(E) It will be instead: = ((2+1)/(9+3)) * ((3+1)/(9+3)) * ((3+1)/(9+2)) * ((3+1)/(9+2)) *(9/14) / P(E) = / P(E) Number of possible values for ‘Outlook’ Number of possible values for ‘Windy’

Missing values Training: instance is not included in the frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: P(play=yes | E) = P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (3/9) * (3/9) * (3/9) *(9/14) / P(E) = / P(E) P(play=no | E) = P(Temp=Cool | play=no) * P(Humidity=High | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (1/5) * (4/5) * (3/5) *(5/14) / P(E) = / P(E) After normalization: P(play=yes | E) = 41%, P(play=no | E) = 59%

Dealing with numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class). Probability density function for the normal distribution is: We approximate  by the sample mean: We approximate    by the sample variance:

Weather Data f(Temperature=66 | yes) =e(- ((66-m)^2 / 2*var) ) / sqrt(2*3.14*var) m = ( )/ 9 = 73 var = ( (83-73)^2 + (70-73)^2 + (68-73)^2 + (64-73)^2 + (69- 73)^2 + (75-73)^2 + (75-73)^2 + (72-73)^2 + (81-73)^2 )/ (9-1) = 38 f(Temperature=66 | yes) =e(- ((66-73)^2 / (2*38) ) ) / sqrt(2*3.14*38) =.034 We compute similarly: f(Temperature=66 | yes)

Weather Data f(Humidity=90 | yes) =e(- ((90-m)^2 / 2*var) ) / sqrt(2*3.14*var) m = ( )/ 9 = 79 var = ( (86-79)^2 + (96-79)^2 + (80-79)^2 + (65-79)^2 + (70- 79)^2 + (80-79)^2 + (70-79)^2 + (90-79)^2 + (75-79)^2 )/ (9-1) = 104 f(Humidity=90 | yes) =e(- ((90-79)^2 / (2*104) ) ) / sqrt(2*3.14*104) =.022 We compute similarly: f(Humidity=90 | yes)

Classifying a new day A new day E: P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=66 | play=yes) * P(Humidity=90 | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (0.034) * (0.022) * (3/9) *(9/14) / P(E) = / P(E) P(play=no | E) = P(Outlook=Sunny | play=no) * P(Temp=66 | play=no) * P(Humidity=90 | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (3/5) * (0.0291) * (0.038) * (3/5) *(5/14) / P(E) = / P(E) After normalization: P(play=yes | E) = 20.9%, P(play=no | E) = 79.1%

Probability densities Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because  cancels out Exact relationship:

Discussion of Naïve Bayes Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class

Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?) (Apply also the Laplace normalization)

Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?) P(Yes) = 3/10 = 0.3 P(Refund=No|Yes) = (3+1)/(3+2) = 0.8 P(Status=Married|Yes) = (0+1)/(3+3) = 0.17 Approximate  with: ( )/3 =90 Approximate  2 with: ( (95-90)^2+(85-90) ^2+(90-90) ^2 )/ (3-1) = 25 f(income=95|Yes) = e(- ( (95-90)^2 / (2*25)) ) / sqrt(2*3.14*25) =.048 P(Yes | E) =  *.8*.17*.048*.3=  *

Tax Data Classify: (_, No, Married, 95K, ?) P(No) = 7/10 =.7 P(Refund=No|No) = (4+1)/(7+2) =.556 P(Status=Married|No) = (4+1)/(7+3) =.5 Approximate  with: ( )/7 =110 Approximate  2 with: (( )^2 + ( )^2 + ( )^2 + ( )^2 + (60-110)^2 + ( )^2 + (75-110)^2 )/(7-1) = 2975 f(income=95|No) = e( -((95-110)^2 / (2*2975)) ) /sqrt(2*3.14* 2975) = P(No | E) =  *.556*.5*.00704*0.7=  *.00137

Tax Data Classify: (_, No, Married, 95K, ?) P(Yes | E) =  * P(No | E) =  *  = 1/( )= P(Yes|E) = * = 0.59 P(No|E) = * = 0.41 We predict “Yes.”

Text Categorization Text categorization is the task of assigning a given document to one of a fixed set of categories, on the basis of the text it contains. Naïve Bayes models are often used for this task. In these models, the query variable is the document category, and the effect variables are the presence or absence of each word in the language. How such a model can be constructed, given as training data a set of documents that have been assigned to categories? The model consists of the prior probability P(Category) and the conditional probabilities P(Word i | Category). For each category c, P(Category=c) is estimated as the fraction of all the “training” documents that are of that category. Similarly, P(Word i = true | Category = c) is estimated as the fraction of documents of category that contain word. Also, P(Word i = true | Category =  c) is estimated as the fraction of documents not of category that contain word.

Text Categorization (cont’d) Now we can use naïve Bayes for classifying a new document: P(Category = c | Word 1 = true, …, Word n = true) =  *P(Category = c)  n i=1 P(Word i = true | Category = c) P(Category =  c | Word 1 = true, …, Word n = true) =  *P(Category =  c)  n i=1 P(Word i = true | Category =  c) Word 1, …, Word n are the words occurring in the new document  is the normalization constant. Observe that similarly with the “missing values” the new documents doesn’t contain every word for which we computed the probabilities.