Data Mining: Concepts and Techniques

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Bayesian Learning Provides practical learning algorithms
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
Bayesian Classification
Classification and Prediction.  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian classifiers.
Classification & Prediction
Classifiers in Atlas CS240B Class Notes UCLA. Data Mining z Classifiers: yBayesian classifiers yDecision trees z The Apriori Algorithm zDBSCAN Clustering:
Bayesian Classification and Bayesian Networks
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
KI2 - 2 Kunstmatige Intelligentie / RuG Probabilities Revisited AIMA, Chapter 13.
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
Probability and Information Copyright, 1996 © Dale Carnegie & Associates, Inc. A brief review (Chapter 13)
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Bayes Classification.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Uncertainty Chapter 13.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Rule Generation [Chapter ]
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Bayesian Networks. Male brain wiring Female brain wiring.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 13, 2012.
Naive Bayes Classifier
Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Bayesian Classifier. 2 Review: Decision Tree Age? Student? Credit? fair excellent >40 31…40
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Bayesian Classification
Classification And Bayesian Learning
Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.
Classification Today: Basic Problem Decision Trees.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh
Chapter 7. Classification and Prediction
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Classification.
Bayesian Classification
Data Mining Lecture 11.
Data Mining: Concepts and Techniques Classification
Classification Techniques: Bayesian Classification
Data Mining Algorithms
Machine Learning Bayes Learning Bai Xiao.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Classification Bayesian Classification 2018年12月30日星期日.
Data Mining Comp. Sc. and Inf. Mgmt. Asian Institute of Technology
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
Mathematical Foundations of BME Reza Shadmehr
Intro. to Data Mining Chapter 6. Bayesian.
Presentation transcript:

Data Mining: Concepts and Techniques Overview Bayesian Probability Bayes’ Rule Naïve Bayesian Classification February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Probability Let P(A) represent the probability that proposition A is true. Example: Let Risky represent that a customer is a high credit risk. P(Risky) = 0.519 means that there is a 51.9% chance a given customer is a high-credit risk. Without any other information, this probability is called the prior or unconditional probability Propositions are just logical statements without quantifiers (for all or there exists) that are either true or false. For the purposes of this course, we will not consider probabilities for first-order logic statements. February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Random Variables Could also consider a random variable X, which can take on one of many values in its domain <x1,x2,…,xn> Example: Let Weather be a random variable with domain <sunny, rain, cloudy, snow>. The probabilities of Weather taking on one of these values is P(Weather=sunny)=0.7 P(Weather=rain)=0.2 P(Weather=cloudy)=0.08 P(Weather=snow)=0.02 February 5, 2018 Data Mining: Concepts and Techniques

Conditional Probability Probabilities of events change when we know something about the world The notation P(A|B) is used to represent the conditional or posterior probability of A Read “the probability of A given that all we know is B.” P(Weather = snow | Temperature = below freezing) = 0.10 The example above is read “there is a 10% chance of snow given that we know the temperature is below freezing.” February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Axioms of Probability All probabilities are between 0 and 1 0P(A) 1 Necessarily true propositions have prob. of 1, necessarily false prob. of 0 P(true) = 1 P(false) = 0 The probability of a disjunction is given by P(AB) = P(A) + P(B) - P(AB) Some examples are done on the board for using the axioms. For example, showing P(A|B  A) = 1. February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Axioms of Probability We can use logical connectives for probabilities P(Weather = snow  Temperature = below freezing) Can use disjunctions (or) or negation (not) as well The product rule P(A  B) = P(A|B)P(B) = P(B|A)P(A) Note that conjunctions are not the same as conditional probabilities. In fact, you can see the relationship above. The conditional probability notation allows for more conciseness. February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Bayes Theorem - 1 P(A) P(B) P(AB) Consider the Venn diagram at right. The area of the rectangle is 1, and the area of each region gives the probability of the event(s) associated with that region P(A|B) means “the probability of observing event A given that event B has already been observed”, i.e. how much of the time that we see B do we also see A? (i.e. the ratio of the purple region to the magenta region) P(A|B) = P(AB)/P(B), and also P(B|A) = P(AB)/P(A), therefore P(A|B) = P(B|A)P(A)/P(B) (Bayes formula for two events) February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Bayes Theorem - 2 More formally, Let X be the sample data (evidence) Let H be a hypothesis that X belongs to class C In classification problems we wish to determine the probability that H holds given the observed sample data X i.e. we seek P(H|X), which is known as the posterior probability of H conditioned on X February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Bayes Theorem - 3 P(H) is the prior probability Similarly, P(X|H) is the posterior probability of X conditioned on H Bayes Theorem (from earlier slide) is then February 5, 2018 Data Mining: Concepts and Techniques

Chapter 7. Classification and Prediction What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from association rule mining Other Classification Methods Prediction Classification accuracy Summary February 5, 2018 Data Mining: Concepts and Techniques

Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Bayesian Theorem Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem MAP (maximum posteriori) hypothesis Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes Practical difficulty: require initial knowledge of many probabilities, significant computational cost February 5, 2018 Data Mining: Concepts and Techniques

Towards Naïve Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X). This can be derived from Bayes’ theorem Since P(X) is constant for all classes, only needs to be maximized Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) February 5, 2018 Data Mining: Concepts and Techniques

Derivation of Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is February 5, 2018 Data Mining: Concepts and Techniques

Bayesian classification The classification problem may be formalized using a-posteriori probabilities: P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. E.g. P(class=N | outlook=sunny,windy=true,…) Idea: assign to sample X the class label C such that P(C|X) is maximal February 5, 2018 Data Mining: Concepts and Techniques

Play-tennis example: estimating P(xi|C) outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 1/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 P(p) = 9/14 P(n) = 5/14 February 5, 2018 Data Mining: Concepts and Techniques

Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false> P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 Sample X is classified in class n (don’t play) February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Numeric Attributes February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Numeric Attributes February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) February 5, 2018 Data Mining: Concepts and Techniques

Naïve Bayesian Classifier: Example Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example 3 Take the following training data, from bank loan applicants: Few Medium PAYS High Philly City Children Many Income Low Status DEFAULTS 3 4 ApplicantID 1 2 As our attributes are all categorical in this case, we obtain our probabilities using simple counts and ratios: P[City=Philly | Status = DEFAULTS] = 2/2 = 1 P[City=Philly | Status = PAYS] = 2/2 = 1 P[Children=Many | Status = DEFAULTS] = 2/2 = 1 P[Children=Few | Status = DEFAULTS] = 0/2 = 0 etc. February 5, 2018 Data Mining: Concepts and Techniques 23

Data Mining: Concepts and Techniques Example 3 Summarizing, we have the following probabilities: Probability of... ... given DEFAULTS ... given PAYS City=Philly 2/2 = 1 Children=Few 0/2 = 0 Children=Many Income=Low 1/2 = 0.5 Income=Medium Income=High and P[Status = DEFAULTS] = 2/4 = 0.5 P[Status = PAYS] = 2/4 = 0.5 The probability of Income=Medium given the applicant DEFAULTs = the number of applicants with Income=Medium who DEFAULT divided by the number of applicants who DEFAULT = 1/2 = 0.5 February 5, 2018 Data Mining: Concepts and Techniques 24

Data Mining: Concepts and Techniques Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=Medium: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*) P[Status = DEFAULTS | Philly,Many,Medium] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0.5 x 0.5 = 0.25 Then we estimate the likelihood that the example is a payer, given its attributes: P[H2|E] = P[E|H2].P[H2] (denominator omitted*) P[Status = PAYS | Philly,Many,Medium] = P[Philly|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0 As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we conclude that the new example is a defaulter. *Note: We haven’t divided by P[Philly,Many,Medium] in the calculations above, as that doesn’t affect which of the two likelihoods is higher, as its applied to both, so it doesn’t affect our result!) February 5, 2018 Data Mining: Concepts and Techniques 25

Data Mining: Concepts and Techniques Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=High: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[Status = DEFAULTS | Philly,Many,High] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0 x 0.5 = 0 Then we estimate the likelihood that the example is a payer, given its attributes: P[Status = PAYS | Philly,Many,High] = P[Philly|PAYS] x P[Many|PAYS] x P[High|PAYS] x P[PAYS] = 1 x 0 x 0 x 0.5 = 0 As the conditional likelihood of being a defaulter is the same as that for being a payer, we can come to no conclusion for this example. February 5, 2018 Data Mining: Concepts and Techniques 26

Example 4 Take the following training data, for credit card authorizations: Excellent Medium AUTHORIZE Good High Credit Income Very High Decision 3 4 TransactionID 1 2 Bad REQUEST ID 7 8 5 6 Low REJECT CALL POLICE 9 10 Source: Adapted from Dunham Assume we’d like to determine how to classify a new transaction, with Income = Medium and Credit=Good. February 5, 2018 Data Mining: Concepts and Techniques 27

Data Mining: Concepts and Techniques Example 4 Our conditional probabilities are: Probability of... ... given AUTHORIZE ... given REQUEST ID ... given REJECT ... given CALL POLICE Income=Very High 2/6 0/2 0/1 Income=High 1/2 1/1 Income=Medium Income=Low 0/6 Credit=Excellent 3/6 Credit=Good Credit=Bad 2/2 Our class probabilities are: P[Decision = AUTHORIZE] = 6/10 P[Decision = REQUEST ID] = 2/10 P[Decision = REJECT] = 1/10 P[Decision = CALL POLICE] = 1/10 February 5, 2018 Data Mining: Concepts and Techniques 28

Data Mining: Concepts and Techniques Example 4 Our goal is now to work out, for each class, the conditional probability of the new transaction (with Income=Medium & Credit=Good) being in that class. The class with the highest probability is the classification we choose. Our conditional probabilities (again, ignoring Bayes’s denominator) are: P[Decision = AUTHORIZE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=AUTHORIZE] x P[Credit=Good|Decision=AUTHORIZE] x P[Decision=AUTHORIZE] = 2/6 x 3/6 x 6/10 = 36/360 = 0.1 P[Decision = REQUEST ID | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REQUEST ID] x P[Credit=Good|Decision=REQUEST ID] x P[Decision=REQUEST ID] = 1/2 x 0/2 x 2/10 = 0 P[Decision = REJECT | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REJECT] x P[Credit=Good|Decision=REJECT] x P[Decision=REJECT] = 0/1 x 0/1 x 1/10 = 0 P[Decision = CALL POLICE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=CALL POLICE] x P[Credit=Good|Decision=CALL POLICE] x P[Decision=CALL POLICE] The highest of these probabilities is the first, so we conclude that the decision for our new transaction should be AUTHORIZE. February 5, 2018 Data Mining: Concepts and Techniques 29

Data Mining: Concepts and Techniques Example 5 February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example 5 Let D=<unknown, low, none, 15-35> Which risk category is D in? Three hypotheses: Risk=low, Risk=moderate, Risk=high Because of naïve assumption, calculate individual probabilities and then multiply together. February 5, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Example 5 P(CH=unknown | Risk=low) = 2/5 P(D|Risk=low)=2/5*3/5*3/5*0/5=0 P(CH=unknown | Risk=moderate) = 1/3 P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494 P(CH=unknown | Risk=high) = 2/6 P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370 P(Debt=low | Risk=low) = 3/5 P(Debt=low | Risk=moderate) = 1/3 P(Risk=low)=5/14 P(Debt=low | Risk=high) = 2/6 P(Risk=moderate)=3/14 P(Coll=none | Risk=low) = 3/5 P(Risk=high)=6/14 P(Coll=none | Risk=moderate) = 2/3 P(Coll=none | Risk=high) = 6/6 P(D|Risk=low)P(Risk=low) = 0*5/14 = 0 P(Inc=15-35 | Risk=low) = 0/5 P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106 P(Inc=15-35 | Risk=moderate) = 2/3 P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159 P(Inc=15-35 | Risk=high) = 2/6 So if we used ML, the answer would be Risk=moderate, but with MAP, the answer is Risk=high. February 5, 2018 Data Mining: Concepts and Techniques

Avoiding the 0-Probability Problem Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their “uncorrected” counterparts February 5, 2018 Data Mining: Concepts and Techniques

Naïve Bayesian Classifier: Comments Advantages : Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence , therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks February 5, 2018 Data Mining: Concepts and Techniques

The independence hypothesis… … makes computation possible … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Attempts to overcome this limitation: Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes Decision trees, that reason on one attribute at the time, considering most important attributes first February 5, 2018 Data Mining: Concepts and Techniques

Bayesian Belief Networks Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Y Z P X February 5, 2018 Data Mining: Concepts and Techniques

Bayesian Belief Network: An Example Family History Smoker The conditional probability table (CPT) for variable LungCancer: LC ~LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.2 0.5 0.7 0.3 0.1 0.9 LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Derivation of the probability of a particular combination of values of X, from CPT: Bayesian Belief Networks February 5, 2018 Data Mining: Concepts and Techniques

Training Bayesian Networks Several scenarios: Given both the network structure and all variables observable: learn only the CPTs Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning Network structure unknown, all variables observable: search through the model space to reconstruct network topology Unknown structure, all hidden variables: No good algorithms known for this purpose Ref. D. Heckerman: Bayesian networks for data mining February 5, 2018 Data Mining: Concepts and Techniques