1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Naïve-Bayes Classifiers Business Intelligence for Managers.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Classifying Categorical Data Risi Thonangi M.S. Thesis Presentation Advisor: Dr. Vikram Pudi.
Statistical Methods Chichang Jou Tamkang University.
Induction of Decision Trees
Machine Learning CMPT 726 Simon Fraser University
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Robust Bayesian Classifier Presented by Chandrasekhar Jakkampudi.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Rule Generation [Chapter ]
Bayesian Networks. Male brain wiring Female brain wiring.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Naive Bayes Classifier
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
CS690L Data Mining: Classification
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Classification Vikram Pudi IIIT Hyderabad.
1 Estimating Accuracy Holdout method – Randomly partition data: training set + test set – accuracy = |correctly classified points| / |test data points|
1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Updating Probabilities Ariel Caticha and Adom Giffin Department of Physics University at Albany - SUNY MaxEnt 2006.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Lecture 1.31 Criteria for optimal reception of radio signals.
Naive Bayes Classifier
Statistical Models for Automatic Speech Recognition
Lecture 15: Text Classification & Naive Bayes
Classification Techniques: Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Discriminative Frequent Pattern Analysis for Effective Classification
The Improved Iterative Scaling Algorithm: A gentle Introduction
Presentation transcript:

1 Bayesian Methods

2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which P(C i /X) is maximum P(C i /X)= P(X/C i ) P(C i ) / P(X)  P(X/C i ) P(C i )  P(x 1 /C i ) P(x 2 /C i )…P(x m /C i ) P(C i ) Naïvely assumes that each x i is independent We represent P(X/C i ) by P(X), etc. when unambiguous

3 Bayesian Belief Networks Naïve Bayes assumes independence between attributes – Not always correct! If we don’t assume independence, the problem becomes exponential – every attribute can be dependent on every other attribute. Luckily, in real life most attributes don’t depend (directly) on other attributes. A Bayesian network explicitly encodes dependencies between attributes.

4 Bayesian Belief Network FamilyHistory Smoker LungCancerEmphysemia PositiveXRayDyspnea FH,SFH,!S !FH,S!FH,!S LC !LC DAG Conditional Probability Table for LungCancer P(X) = P(x 1 | Parents(x 1 )) P(x 2 | Parents(x 2 ))…P(x m | Parents(x m )) e.g. P(PositiveXRay, Dyspnea)

5 Maximum Entropy Approach Think s, keywords, spam / non-spam Given a new data point X={x 1,x 2,…,x m } to classify calculate P(C i /X) for each class C i. Select C i for which P(C i /X) is maximum P(C i /X)= P(X/C i ) P(C i ) / P(X)  P(X/C i ) P(C i ) Naïve Bayes assumes that each x i is independent Instead estimate P(X/C i ) directly from training data: support C i (X) Problem: There may be no instance of X in training data. – Training data is usually sparse Solution: Estimate P(X/C i ) from available features in training data: P(Y j /C i ) might be known for several Y j

6 Background: Shannon’s Entropy An expt has several possible outcomes In N expts, suppose each outcome occurs M times This means there are N/M possible outcomes To represent each outcome, we need log N/M bits. – This generalizes even when all outcomes are not equally frequent. – Reason: For an outcome j that occurs M times, there are N/M equi-probable events among which only one cp to j Since p i = M / N, information content of an outcome is -log p i So, expected info content: H = - Σ p i log p i

7 Maximum Entropy Principle Entropy corresponds to the disorder in a system – Intuition: A highly ordered system will require less bits to represent it If we do not have evidence for any particular order in a system, we should assume that no such order exists The order that we know of can be represented in the form of constraints Hence, we should maximize the entropy of a system subject to the known constraints If the constraints are consistent, there is a unique solution that maximizes entropy.

8 Max Ent in Classification Among the distributions P(X/C i ), choose the one that has maximum entropy. Use the selected distribution to classify according to bayesian approach.

9 Association Rule Based Methods

10 CPAR, CMAR, etc. Separate training data for each class Find frequent itemsets in each class – Class Association Rules: LHS = frequent itemset, RHS = class label To classify record R, find all association rules of each class that apply on R. Combine the evidence of rules to decide which class R belongs to. – E.g. Add the probabilities of the best k rules. – Mathematically incorrect, but work well in practice.

11 Max Ent + Frequent Itemsets ACME

12 ACME The frequent itemsets of each class, with their probabilities are used as constraints in a max-entropy model. – Evidences are combined using max-ent – Mathematically robust In practice, frequent itemsets represent all the significant constraints of a class. – Best in theory and practice But, slow.

13 Preliminaries Record Class a, b, c C 1 b, c C 1 a, d C 2 a, c C 1 I = {a, b, c, d} C = {C 1,C 2 } features classes a, b, d ? query

14 Frequent Itemsets Record Class a, b, c C 1 b, c C 1 a, d C 2 a, c C 1 An itemset whose frequency is greater than a minimum-support is called a frequent itemset. Frequent itemsets are mined using Apriori Algorithm. Ex: If minimum-support is 2, then {b,c} will be a frequent itemset.

15 Split Data by Classes Records a, b, c b, c a, c Records a, d C1C1 C2C2 S1S1 S2S2 Frequent Itemsets of C 2 Frequent Itemsets of C 1 apriori

16 Build Constraints for a class Records a, b, c b, c a, c s j p j b, c 0.67 a, b, c 0.33 b 0.67 C1C1 Constraints of C 1

17 Build distribution of class C 1 X P(X|C 1 ) a b c d s i p i b, c 0.67 a, b, c 0.33 b 0.67 Total possible records – 2 4 in number Maximum Entropy Principle: Build a distribution P(X|C 1 ) that conforms to the constraints and has the highest Entropy constraints

18 Log-Linear Modeling These µ‘s can be computed by an iterative fitting algorithm like the GIS algorithm.

19 Generalized Iterative Scaling Algorithm # N items, M constraints P(X k ) = 1 / 2 N // for k = (1…2 N ); Uniform distribution  j = 1 # for j = (1…M) while all constraints not satisfied: for each constraint C j : S j =  (k: T k satisfies Y j ) P(X k )  j *= d j / S j P(X k ) =  0  (j satisfied by T k )  j #  0 is to ensure that  k P(X k ) = 1

20 Problem with the Log-Linear Model s i p i b, c 0.67 b Solution does not exist if P(X|C j ) = 0 for any X. Prob. is 0 for all ‘X’ which have ‘b=1’ but ‘c=0’.

21 Fix to the Model X P(X|C 1 ) a b c d set to Fix: Define the model on only those ‘X’ whose probability is non-zero. Explicitly set these record probabilities to zero and learn for µ’s without considering them. Learning time decreases as |X| decreases

22 Effect of pruning Dataset# ConsPruned X Austra(354) % Waveform(99) 241.3% Cleve(246) % Diabetes(85) % German(54) % Heart(115) % Breast(189) % Lymph(29) % Pima(87) 558.6% Datasets chosen from UCI ML Repository.

23 Making the approach scalable (1) Remove non-informative constraints. – A constraint is informative if it can distinguish between classes very well. Use the standard information measure Ex: s 1 = {a,b,c} P( C 1 | s 1 ) = 0.45 and P( C 2 | s 1 ) = 0.55 Remove {a,b,c} from the constraint set. s 2 = { b, c } P( C 1 | s 2 ) = 0.8 and P( C 2 | s 2 ) = 0.2 Include { b, c } in the constraint set.

24 Making the approach scalable (2) Splitting: Split the set of features ‘I’ into groups that are independent of each other. – Two groups of features are independent of each other if they don’t have an overlapping constraint between them Global P(.) can be calculated by merging individual P(.)’s of each group in a naïve-bayes fashion Ex: I = {a,b,c,d}, and constraints are {a}, {a,b} and {c,d}. Split I into I 1 ={a,b} and I 2 ={c,d}. Learn Log-Linear models P 1 (.) for I 1 ={a,b} and P 2 (.) for I 2 ={c,d} P(b,c) = P 1 (b) * P 2 (c)