Intro. to Data Mining Chapter 6. Bayesian.

Intro. to Data Mining Chapter 6. Bayesian

What Is Classification?
Model Learning Training Instances Positive Prediction Model ok Test Instances Negative

Typical Classification Methods
age? student? credit rating? <=30 >40 no yes 31..40 fair excellent Decision Tree Support Vector Machine and many more… Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea Bayesian Network Neural Network ok

Pattern-Based Classification, Why?
Frequent Pattern Mining Classification Pattern-Based Pattern-based classification: An integration of both themes Why pattern-based classification? Feature construction Higher order; compact; discriminative E.g., single word → phrase (Apple pie, Apple i-pad) Complex data modeling Graphs (no predefined feature vectors) Sequences Semi-structured/unstructured Data Single feature is not enough Complex data is difficult Background

Pattern-Based Classification on Graphs
Inactive Frequent subgraphs Use frequent patterns as features for classification g1 g2 g1 g2 Class 1 Active Mining Transform min_sup=2 Related work Major ones: good and problems: not confined to rule-based, most discriminative features, any classifier Accurate Emerging patterns 2 slides Inactive Inactive

Discrete Random Variables
Finite set of possible outcomes X binary:

Continuous Random Variable
Probability distribution (density function) over continuous values

Conditional probability

Mutually exclusive / independence

Joint / marginal probability

Example

Bayes Rule Uses prior probability (事前機率; 先天機率) of each category given no information about an item. Categorization produces a posterior probability (事後機率; 條件機率) distribution over the possible categories given a description of an item.

Naïve Bayes Classifier: Training Dataset
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

Naïve Bayes Classifier: another calculation example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = x x x = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)

Naive Bayes

Naive Bayes example

Different types of variables

Discrete variables

Continuous variables

Continuous variables example

Bayes example

Bayes classifier example

Bayes classifier with several features

N-gram models <s> I am Sam </s>
In general this is an insufficient model of language because language has long- distance dependencies, but… <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

NaiveBayes Example

Discussion of bayes’

Example of bayes’

Laplace estimator

M-estimate

M-estimator example

Naïve Bayes Classifier: Comments
Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes Classifier How to deal with these dependencies? Bayesian Belief Networks

Discussion of bayes’

Bayesian Belief Networks
Bayesian belief networks (also known as Bayesian networks, probabilistic networks): allow class conditional independencies between subsets of variables A (directed acyclic) graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops/cycles Y Z P X 44 44

Examples of 3-way Bayesian Networks
Marginal Independence: p(A,B,C) = p(A) p(B) p(C) A C B Conditionally independent effects: p(A,B,C) = p(B|A)p(C|A)p(A) B and C are conditionally independent Given A e.g., A is a disease, and we model B and C as conditionally independent symptoms given A A C B

Examples of 3-way Bayesian Networks
C Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B) “Explaining away” effect: Given C, observing A makes B less likely e.g., earthquake/burglary/alarm example A and B are (marginally) independent but become dependent once C is known A C B Markov dependence: p(A,B,C) = p(C|B) p(B|A)p(A)

Bayesian Belief Networks example

Discussion of Bayesian Belief Networks

Conditional Independence
A variable (node) is conditionally independent of its non-descendants given its parents. Age Gender Non-Descendants Exposure to Toxics Smoking Parents Cancer is independent of Age and Gender given Exposure to Toxics and Smoking. Cancer Serum Calcium Lung Tumor Descendants

The learning task Output: BN modeling data ... Input: training data
B E A C N Call Alarm Burglary Earthquake Newscast Output: BN modeling data e a c b n b e a c n ... Input: training data Input: fully or partially observable data cases? Output: parameters or also structure?

Structure learning Goal: find “good” BN structure (relative to data)
Solution: do heuristic search over space of network structures.

Search space Space = network structures
Operators = add/reverse/delete edges

Heuristic search Use scoring function to do heuristic search (any algorithm). Greedy hill-climbing with randomness works pretty well. score

Statistical Independency testing
Statistic formula is used to test for the independence of A and B where = the number of times the expression level of A = a = the number of times the expression level of B = b = the number of times both the expression levels of A = a and B = b respectively. M = total number of data. G2 has the chi-square distribution with appropriate degrees of freedom where rA, rB are the number of expression levels of the data spaces. [Richard E, “Learning Bayesian networks”, 2004 ]

Example (Statistical Independency testing)
Suppose G1 and G2 each have expression level {+,-} G1 G2 1 + - 2 3 4 5 6 7 8 G2 = + G2 = - G1 = + 1 2 3 G1 = - 5 4 We cannot reject the hypothesis that the G1 and G2 are independence.

Intro. to Data Mining Chapter 6. Bayesian.

Similar presentations

Presentation on theme: "Intro. to Data Mining Chapter 6. Bayesian."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intro. to Data Mining Chapter 6. Bayesian.

Similar presentations

Presentation on theme: "Intro. to Data Mining Chapter 6. Bayesian."— Presentation transcript:

Similar presentations

About project

Feedback