Data Mining: Concepts and Techniques Overview Bayesian Probability Bayes’ Rule Naïve Bayesian Classification February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Probability Let P(A) represent the probability that proposition A is true. Example: Let Risky represent that a customer is a high credit risk. P(Risky) = 0.519 means that there is a 51.9% chance a given customer is a high-credit risk. Without any other information, this probability is called the prior or unconditional probability Propositions are just logical statements without quantifiers (for all or there exists) that are either true or false. For the purposes of this course, we will not consider probabilities for first-order logic statements. February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Random Variables Could also consider a random variable X, which can take on one of many values in its domain <x1,x2,…,xn> Example: Let Weather be a random variable with domain <sunny, rain, cloudy, snow>. The probabilities of Weather taking on one of these values is P(Weather=sunny)=0.7 P(Weather=rain)=0.2 P(Weather=cloudy)=0.08 P(Weather=snow)=0.02 February 5, 2018 Data Mining: Concepts and Techniques
Conditional Probability Probabilities of events change when we know something about the world The notation P(A|B) is used to represent the conditional or posterior probability of A Read “the probability of A given that all we know is B.” P(Weather = snow | Temperature = below freezing) = 0.10 The example above is read “there is a 10% chance of snow given that we know the temperature is below freezing.” February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Axioms of Probability All probabilities are between 0 and 1 0P(A) 1 Necessarily true propositions have prob. of 1, necessarily false prob. of 0 P(true) = 1 P(false) = 0 The probability of a disjunction is given by P(AB) = P(A) + P(B) - P(AB) Some examples are done on the board for using the axioms. For example, showing P(A|B A) = 1. February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Axioms of Probability We can use logical connectives for probabilities P(Weather = snow Temperature = below freezing) Can use disjunctions (or) or negation (not) as well The product rule P(A B) = P(A|B)P(B) = P(B|A)P(A) Note that conjunctions are not the same as conditional probabilities. In fact, you can see the relationship above. The conditional probability notation allows for more conciseness. February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Bayes Theorem - 1 P(A) P(B) P(AB) Consider the Venn diagram at right. The area of the rectangle is 1, and the area of each region gives the probability of the event(s) associated with that region P(A|B) means “the probability of observing event A given that event B has already been observed”, i.e. how much of the time that we see B do we also see A? (i.e. the ratio of the purple region to the magenta region) P(A|B) = P(AB)/P(B), and also P(B|A) = P(AB)/P(A), therefore P(A|B) = P(B|A)P(A)/P(B) (Bayes formula for two events) February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Bayes Theorem - 2 More formally, Let X be the sample data (evidence) Let H be a hypothesis that X belongs to class C In classification problems we wish to determine the probability that H holds given the observed sample data X i.e. we seek P(H|X), which is known as the posterior probability of H conditioned on X February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Bayes Theorem - 3 P(H) is the prior probability Similarly, P(X|H) is the posterior probability of X conditioned on H Bayes Theorem (from earlier slide) is then February 5, 2018 Data Mining: Concepts and Techniques
Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary February 5, 2018 Data Mining: Concepts and Techniques
Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem MAP (maximum posteriori) hypothesis Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: require initial knowledge of many probabilities, significant computational cost February 5, 2018 Data Mining: Concepts and Techniques
Towards Naïve Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X). This can be derived from Bayes’ theorem Since P(X) is constant for all classes, only needs to be maximized Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) February 5, 2018 Data Mining: Concepts and Techniques
Derivation of Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is February 5, 2018 Data Mining: Concepts and Techniques
Bayesian classification The classification problem may be formalized using a-posteriori probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook=sunny,windy=true,…) Idea: assign to sample X the class label C such that P(C|X) is maximal February 5, 2018 Data Mining: Concepts and Techniques
Play-tennis example: estimating P(xi|C) outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 1/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14 February 5, 2018 Data Mining: Concepts and Techniques
Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false> P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 Sample X is classified in class n (don’t play) February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Numeric Attributes February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Numeric Attributes February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) February 5, 2018 Data Mining: Concepts and Techniques
Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Example 3 Take the following training data, from bank loan applicants: Few Medium PAYS High Philly City Children Many Income Low Status DEFAULTS 3 4 ApplicantID 1 2 As our attributes are all categorical in this case, we obtain our probabilities using simple counts and ratios: P[City=Philly | Status = DEFAULTS] = 2/2 = 1 P[City=Philly | Status = PAYS] = 2/2 = 1 P[Children=Many | Status = DEFAULTS] = 2/2 = 1 P[Children=Few | Status = DEFAULTS] = 0/2 = 0 etc. February 5, 2018 Data Mining: Concepts and Techniques 23
Data Mining: Concepts and Techniques Example 3 Summarizing, we have the following probabilities: Probability of... ... given DEFAULTS ... given PAYS City=Philly 2/2 = 1 Children=Few 0/2 = 0 Children=Many Income=Low 1/2 = 0.5 Income=Medium Income=High and P[Status = DEFAULTS] = 2/4 = 0.5 P[Status = PAYS] = 2/4 = 0.5 The probability of Income=Medium given the applicant DEFAULTs = the number of applicants with Income=Medium who DEFAULT divided by the number of applicants who DEFAULT = 1/2 = 0.5 February 5, 2018 Data Mining: Concepts and Techniques 24
Data Mining: Concepts and Techniques Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=Medium: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*) P[Status = DEFAULTS | Philly,Many,Medium] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0.5 x 0.5 = 0.25 Then we estimate the likelihood that the example is a payer, given its attributes: P[H2|E] = P[E|H2].P[H2] (denominator omitted*) P[Status = PAYS | Philly,Many,Medium] = P[Philly|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0 As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we conclude that the new example is a defaulter. *Note: We haven’t divided by P[Philly,Many,Medium] in the calculations above, as that doesn’t affect which of the two likelihoods is higher, as its applied to both, so it doesn’t affect our result!) February 5, 2018 Data Mining: Concepts and Techniques 25
Data Mining: Concepts and Techniques Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=High: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[Status = DEFAULTS | Philly,Many,High] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0 x 0.5 = 0 Then we estimate the likelihood that the example is a payer, given its attributes: P[Status = PAYS | Philly,Many,High] = P[Philly|PAYS] x P[Many|PAYS] x P[High|PAYS] x P[PAYS] = 1 x 0 x 0 x 0.5 = 0 As the conditional likelihood of being a defaulter is the same as that for being a payer, we can come to no conclusion for this example. February 5, 2018 Data Mining: Concepts and Techniques 26
Example 4 Take the following training data, for credit card authorizations: Excellent Medium AUTHORIZE Good High Credit Income Very High Decision 3 4 TransactionID 1 2 Bad REQUEST ID 7 8 5 6 Low REJECT CALL POLICE 9 10 Source: Adapted from Dunham Assume we’d like to determine how to classify a new transaction, with Income = Medium and Credit=Good. February 5, 2018 Data Mining: Concepts and Techniques 27
Data Mining: Concepts and Techniques Example 4 Our conditional probabilities are: Probability of... ... given AUTHORIZE ... given REQUEST ID ... given REJECT ... given CALL POLICE Income=Very High 2/6 0/2 0/1 Income=High 1/2 1/1 Income=Medium Income=Low 0/6 Credit=Excellent 3/6 Credit=Good Credit=Bad 2/2 Our class probabilities are: P[Decision = AUTHORIZE] = 6/10 P[Decision = REQUEST ID] = 2/10 P[Decision = REJECT] = 1/10 P[Decision = CALL POLICE] = 1/10 February 5, 2018 Data Mining: Concepts and Techniques 28
Data Mining: Concepts and Techniques Example 4 Our goal is now to work out, for each class, the conditional probability of the new transaction (with Income=Medium & Credit=Good) being in that class. The class with the highest probability is the classification we choose. Our conditional probabilities (again, ignoring Bayes’s denominator) are: P[Decision = AUTHORIZE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=AUTHORIZE] x P[Credit=Good|Decision=AUTHORIZE] x P[Decision=AUTHORIZE] = 2/6 x 3/6 x 6/10 = 36/360 = 0.1 P[Decision = REQUEST ID | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REQUEST ID] x P[Credit=Good|Decision=REQUEST ID] x P[Decision=REQUEST ID] = 1/2 x 0/2 x 2/10 = 0 P[Decision = REJECT | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REJECT] x P[Credit=Good|Decision=REJECT] x P[Decision=REJECT] = 0/1 x 0/1 x 1/10 = 0 P[Decision = CALL POLICE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=CALL POLICE] x P[Credit=Good|Decision=CALL POLICE] x P[Decision=CALL POLICE] The highest of these probabilities is the first, so we conclude that the decision for our new transaction should be AUTHORIZE. February 5, 2018 Data Mining: Concepts and Techniques 29
Data Mining: Concepts and Techniques Example 5 February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Example 5 Let D=<unknown, low, none, 15-35> Which risk category is D in? Three hypotheses: Risk=low, Risk=moderate, Risk=high Because of naïve assumption, calculate individual probabilities and then multiply together. February 5, 2018 Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques Example 5 P(CH=unknown | Risk=low) = 2/5 P(D|Risk=low)=2/5*3/5*3/5*0/5=0 P(CH=unknown | Risk=moderate) = 1/3 P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494 P(CH=unknown | Risk=high) = 2/6 P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370 P(Debt=low | Risk=low) = 3/5 P(Debt=low | Risk=moderate) = 1/3 P(Risk=low)=5/14 P(Debt=low | Risk=high) = 2/6 P(Risk=moderate)=3/14 P(Coll=none | Risk=low) = 3/5 P(Risk=high)=6/14 P(Coll=none | Risk=moderate) = 2/3 P(Coll=none | Risk=high) = 6/6 P(D|Risk=low)P(Risk=low) = 0*5/14 = 0 P(Inc=15-35 | Risk=low) = 0/5 P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106 P(Inc=15-35 | Risk=moderate) = 2/3 P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159 P(Inc=15-35 | Risk=high) = 2/6 So if we used ML, the answer would be Risk=moderate, but with MAP, the answer is Risk=high. February 5, 2018 Data Mining: Concepts and Techniques
Avoiding the 0-Probability Problem Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their “uncorrected” counterparts February 5, 2018 Data Mining: Concepts and Techniques
Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks February 5, 2018 Data Mining: Concepts and Techniques
The independence hypothesis… … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first February 5, 2018 Data Mining: Concepts and Techniques
Bayesian Belief Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Y Z P X February 5, 2018 Data Mining: Concepts and Techniques
Bayesian Belief Network: An Example Family History Smoker The conditional probability table (CPT) for variable LungCancer: LC ~LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.2 0.5 0.7 0.3 0.1 0.9 LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Derivation of the probability of a particular combination of values of X, from CPT: Bayesian Belief Networks February 5, 2018 Data Mining: Concepts and Techniques
Training Bayesian Networks Several scenarios: Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct network topology Unknown structure, all hidden variables: No good algorithms known for this purpose Ref. D. Heckerman: Bayesian networks for data mining February 5, 2018 Data Mining: Concepts and Techniques