Intro. to Data Mining Chapter 6. Bayesian.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Bayesian Classification
Ch5 Stochastic Methods Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.
What is Statistical Modeling
Data Mining Classification: Naïve Bayes Classifier
Software Engineering Laboratory1 Introduction of Bayesian Network 4 / 20 / 2005 CSE634 Data Mining Prof. Anita Wasilewska Hiroo Kusaba.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Read R&N Ch Next lecture: Read R&N
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Reasoning with Bayesian Networks. Overview Bayesian Belief Networks (BBNs) can reason with networks of propositions and associated probabilities Useful.
A Brief Introduction to Graphical Models
Rule Generation [Chapter ]
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Bayesian Networks. Male brain wiring Female brain wiring.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
Reasoning with Bayesian Belief Networks. Overview Bayesian Belief Networks (BBNs) can reason with networks of propositions and associated probabilities.
Introduction to Bayesian Networks
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
Bayesian Classification
Classification And Bayesian Learning
Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Classification Today: Basic Problem Decision Trees.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Reasoning with Bayesian Belief Networks
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Read R&N Ch Next lecture: Read R&N
Learning Bayesian Network Models from Data
Bayesian Classification
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Read R&N Ch Next lecture: Read R&N
Uncertainty in AI.
Read R&N Ch Next lecture: Read R&N
Data Mining Classification: Alternative Techniques
Read R&N Ch Next lecture: Read R&N
Pattern Recognition and Image Analysis
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —
Computer Vision Chapter 4
An Algorithm for Bayesian Network Construction from Data
Class #19 – Tuesday, November 3
Class #16 – Tuesday, October 26
LECTURE 23: INFORMATION THEORY REVIEW
Read R&N Ch Next lecture: Read R&N
CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu
Machine Learning: UNIT-3 CHAPTER-1
Naive Bayes Classifier
Read R&N Ch Next lecture: Read R&N
Linear Discrimination
CS 685: Special Topics in Data Mining Spring 2009 Jinze Liu
Classification 1.
Presentation transcript:

Intro. to Data Mining Chapter 6. Bayesian

What Is Classification? Model Learning Training Instances Positive Prediction Model ok Test Instances Negative

Typical Classification Methods age? student? credit rating? <=30 >40 no yes 31..40 fair excellent Decision Tree Support Vector Machine and many more… Family History LungCancer PositiveXRay Smoker Emphysema Dyspnea Bayesian Network Neural Network ok

Pattern-Based Classification, Why? Frequent Pattern Mining Classification Pattern-Based Pattern-based classification: An integration of both themes Why pattern-based classification? Feature construction Higher order; compact; discriminative E.g., single word → phrase (Apple pie, Apple i-pad) Complex data modeling Graphs (no predefined feature vectors) Sequences Semi-structured/unstructured Data Single feature is not enough Complex data is difficult Background

Pattern-Based Classification on Graphs Inactive Frequent subgraphs Use frequent patterns as features for classification g1 g2 g1 g2 Class 1 Active Mining Transform min_sup=2 Related work Major ones: good and problems: not confined to rule-based, most discriminative features, any classifier Accurate Emerging patterns 2 slides Inactive Inactive

Discrete Random Variables Finite set of possible outcomes X binary:

Continuous Random Variable Probability distribution (density function) over continuous values 5 7

Conditional probability

Mutually exclusive / independence

Joint / marginal probability

Example

Bayes Rule Uses prior probability (事前機率; 先天機率) of each category given no information about an item. Categorization produces a posterior probability (事後機率; 條件機率) distribution over the possible categories given a description of an item.

Naïve Bayes Classifier: Training Dataset C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

Naïve Bayes Classifier: another calculation example P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”)

Naive Bayes

Naive Bayes example

Naive Bayes example

Naive Bayes example

Different types of variables

Discrete variables

Continuous variables

Continuous variables example

Bayes example

Bayes classifier example

Bayes classifier example

Bayes classifier example

Bayes classifier example

Bayes classifier with several features

Bayes classifier with several features

language model How to compute this joint probability: Recall the definition of conditional probabilities p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A) More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) P(its, water, is, so, transparent, that) P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)

N-gram models <s> I am Sam </s> In general this is an insufficient model of language because language has long- distance dependencies, but… <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>

NaiveBayes Example

Discussion of bayes’

Discussion of bayes’

Example of bayes’

Laplace estimator

Laplace estimator

Laplace estimator

M-estimate

M-estimate

M-estimator example

Naïve Bayes Classifier: Comments Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. Dependencies among these cannot be modeled by Naïve Bayes Classifier How to deal with these dependencies? Bayesian Belief Networks

Discussion of bayes’

Bayesian Belief Networks Bayesian belief networks (also known as Bayesian networks, probabilistic networks): allow class conditional independencies between subsets of variables A (directed acyclic) graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops/cycles Y Z P X 44 44

Bayesian Belief Networks

Bayesian Belief Networks

Examples of 3-way Bayesian Networks Marginal Independence: p(A,B,C) = p(A) p(B) p(C) A C B Conditionally independent effects: p(A,B,C) = p(B|A)p(C|A)p(A) B and C are conditionally independent Given A e.g., A is a disease, and we model B and C as conditionally independent symptoms given A A C B

Examples of 3-way Bayesian Networks C Independent Causes: p(A,B,C) = p(C|A,B)p(A)p(B) “Explaining away” effect: Given C, observing A makes B less likely e.g., earthquake/burglary/alarm example A and B are (marginally) independent but become dependent once C is known A C B Markov dependence: p(A,B,C) = p(C|B) p(B|A)p(A)

Bayesian Belief Networks

Bayesian Belief Networks example

Bayesian Belief Networks example

Discussion of Bayesian Belief Networks

Conditional Independence A variable (node) is conditionally independent of its non-descendants given its parents. Age Gender Non-Descendants Exposure to Toxics Smoking Parents Cancer is independent of Age and Gender given Exposure to Toxics and Smoking. Cancer Serum Calcium Lung Tumor Descendants

The learning task Output: BN modeling data ... Input: training data B E A C N Call Alarm Burglary Earthquake Newscast Output: BN modeling data e a c b n b e a c n ... Input: training data Input: fully or partially observable data cases? Output: parameters or also structure?

Structure learning Goal: find “good” BN structure (relative to data) Solution: do heuristic search over space of network structures.

Search space Space = network structures Operators = add/reverse/delete edges

Heuristic search Use scoring function to do heuristic search (any algorithm). Greedy hill-climbing with randomness works pretty well. score

Statistical Independency testing Statistic formula is used to test for the independence of A and B where = the number of times the expression level of A = a = the number of times the expression level of B = b = the number of times both the expression levels of A = a and B = b respectively. M = total number of data. G2 has the chi-square distribution with appropriate degrees of freedom where rA, rB are the number of expression levels of the data spaces. [Richard E, “Learning Bayesian networks”, 2004 ]

Example (Statistical Independency testing) Suppose G1 and G2 each have expression level {+,-} G1 G2 1 + - 2 3 4 5 6 7 8 G2 = + G2 = - G1 = + 1 2 3 G1 = - 5 4 We cannot reject the hypothesis that the G1 and G2 are independence.