M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.

Slides:



Advertisements
Similar presentations
Classification Techniques: Decision Tree Learning
Advertisements

What we will cover here What is a classifier
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Naïve Bayes Classifier
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Lazy Associative Classification By Adriano Veloso,Wagner Meira Jr., Mohammad J. Zaki Presented by: Fariba Mahdavifard Department of Computing Science University.
Data Mining with Naïve Bayesian Methods
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Algorithms for Classification: The Basic Methods.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Chapter 4: Algorithms CS 795.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
COM24111: Machine Learning Decision Trees Gavin Brown
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Data Science Algorithms: The Basic Methods
Classification Algorithms
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Decision Tree Saed Sayad 9/21/2018.
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Dept. of Computer Science University of Liverpool
Dept. of Computer Science University of Liverpool
Prepared by: Mahmoud Rafeek Al-Farra
Dept. of Computer Science University of Liverpool
Parametric Methods Berlin Chen, 2005 References:
Dept. of Computer Science University of Liverpool
NAÏVE BAYES CLASSIFICATION
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Classification: Bayes February 17, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Statistical Modeling Bayes Rule Naïve Bayes Fixes to Naïve Bayes Document classification Bayesian Networks Structure Learning Today's Topics Classification: Bayes February 17, 2009 Slide 3 COMP527: Data Mining

The probability of hypothesis H, given evidence E: Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[H] = A Priori probability of H (before evidence seen)‏ Pr[H|E] = A Posteriori probability of H (after evidence seen) We want to use this in a classification system, so our goal is to find the most probable hypothesis (class) given the evidence (test instance). Bayes Rule Classification: Bayes February 17, 2009 Slide 4 COMP527: Data Mining

Meningitis causes a stiff neck 50% of the time. Meningitis occurs 1/50,000, stiff necks occur 1/20. Probability of Meningitis, given that the patient has a stiff neck: Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN] = 0.5 * 1/50000 / 1/20 = Example Classification: Bayes February 17, 2009 Slide 5 COMP527: Data Mining

Our evidence E is made up of different attributes A [1..n], so: Pr[H|E] = Pr[A 1 |H]*Pr[A 2 |H]...Pr[An|H]*Pr[H]/Pr[E] So we need to work out the probability of the individual attributes per class. Easy... Outlook=Sunny appears twice for yes out of 9 yes instances. We can work these out for all of our training instances... Bayes Rule Classification: Bayes February 17, 2009 Slide 6 COMP527: Data Mining

Given a test instance (sunny, cool, high, true) play=yes: 2/9 * 3/9 * 3/9 * 9/14 = play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = So we'd predict play=no for that particular instance. Weather Probabilities Classification: Bayes February 17, 2009 Slide 7 COMP527: Data Mining

play=yes: 2/9 * 3/9 * 3/9 * 9/14 = play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = This is the likelihood, not the probability. We need to normalise these. Prob(yes) = / = 20.5% This is when the Pr[E] denominator disappears from Bayes's rule. Nice. Surely there's more to it than this... ? Weather Probabilities Classification: Bayes February 17, 2009 Slide 8 COMP527: Data Mining

Issue: It's only valid to multiply probabilities when the events are independent of each other. It is “naïve” to assume independence between attributes in datasets, hence the name. Eg: The probability of Liverpool winning a football match is not independent of the probabilities for each member of the team scoring a goal. But even given that, Naïve Bayes is still very effective in practice, especially if we can eliminate redundant attributes before processing. Naïve Bayes Classification: Bayes February 17, 2009 Slide 9 COMP527: Data Mining

Issue: If an attribute value does not co-occur with a class value, then the probability generated for it will be 0. Eg: Given outlook=overcast, the probability of play=no is 0/5. The other attributes will be ignored as the final result will be multiplied by 0. This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set. You can easily imagine a case where the likelihood for all classes is 0. Eg: 'Viagra' is always spam, 'data mining' is never spam. An with both will be 0 for spam=yes and 0 for spam=no... probability will be undefined... uh oh! Naïve Bayes Classification: Bayes February 17, 2009 Slide 10 COMP527: Data Mining

The trivial solution is of course to mess with the probabilities such that you never have 0s. We add 1 to the numerator and 3 to the denominator to compensate. So we end up with 1/8 instead of 0/5. No reason to use 3, could use 2 and 6. No reason to split equally... we could add weight to some attributes by giving them a larger share: a+3/na+6 * b+2/nb+6 * c+1/nc+6 However, how to assign these is unclear. For reasonable training sets, simply initialise counts to 1 rather than 0. Laplace Estimator Classification: Bayes February 17, 2009 Slide 11 COMP527: Data Mining

Naïve Bayes deals well with missing values: Training: Ignore the instance for the attribute/class combination, but we can still use it for the known attributes. Classification: Ignore the attribute in the calculation as the difference will be normalised during the final step anyway. Missing Values Classification: Bayes February 17, 2009 Slide 12 COMP527: Data Mining

Naïve Bayes does not deal well with numeric values without some help. The probability of it being exactly 65 degrees is zero. We could discretize the attribute, but instead we'll calculate the mean and standard deviation and use a density function to predict the probability. mean: sum(values) / count(values)‏ variance: sum(square(value - mean)) / count(values)-1 standard deviation: square root of variance Mean for temperature is 73, Std. Deviation is 6.2 Numeric Values Classification: Bayes February 17, 2009 Slide 13 COMP527: Data Mining

Density function: f(x) = Unless you've a math background, just plug the numbers in... At which point we get a likelihood of Then we continue with this number as before. This assumes a reasonably normal distribution. Often not the case. Numeric Values Classification: Bayes February 17, 2009 Slide 14 COMP527: Data Mining )‏)‏( 2 1    xx e

The Bayesian model is often used to classify documents as it deals well with a huge number of attributes simultaneously. (eg boolean occurrence of words within the text)‏ But we may know how many times the word occurs. This leads to Multinomial Naive Bayes. Assumptions: 1. Probability of a word occurring in a document is independent of its location within the document. 2. The document length is not related to the class. Document Classification Classification: Bayes February 17, 2009 Slide 15 COMP527: Data Mining

Pr[E|H] = N! * product(p n /n!)‏ N = number of words in document p = relative frequency of word in documents of class H n = number of occurrences of word in document So, if A has 75% and B has 25% frequency in class H Pr[“A A A”|H] = 3! * /3! * /0! = 27/64 = Pr[“A A A B B”|H] = 5! * /3! * /2! = Document Classification Classification: Bayes February 17, 2009 Slide 16 COMP527: Data Mining

Pr[E|H] = N! * product(p n /n!)‏ We don't need to work out all the factorials, as they'll normalise out at the end. We still end up with insanely small numbers, as vocabularies are much much larger than 2 words. Instead we can sum the logarithms of the probabilities instead of multiplying them. Document Classification Classification: Bayes February 17, 2009 Slide 17 COMP527: Data Mining

Back to the attribute independence assumption. Can we get rid of it? Yes, with a Bayesian Network. Each attribute has a node in a Directed Acyclic Graph. Each node has a table of all attributes with edges pointing at the node linked against the probabilities for the attribute's values. Examples will be hopefully enlightening... Bayesian Networks Classification: Bayes February 17, 2009 Slide 18 COMP527: Data Mining

Simple Network Classification: Bayes February 17, 2009 Slide 19 COMP527: Data Mining play yes no outlook play| sunny overcast rainy yes | no | temperature play| hot mild cold yes | no | windy play| false true yes | no | humidity play| high normal yes | no |

Less Simple Network Classification: Bayes February 17, 2009 Slide 20 COMP527: Data Mining play yes no Outlook play sunny overcast rainy yes no temperature play outlook hot mild cold yes sunny yes overcast yes rainy no sunny no overcast no rainy windy play outlook false true yes sunny yes overcast yes rainy no sunny no overcast no rainy humidity play temp high normal yes hot yes mild yes cool no hot no mild no cool

To use the network, simply step through each node and multiply the results in the table together for the instance's attributes' values. Or, more likely, sum the logarithms as with the multinomial case. Then, as before, normalise them to sum to 1. This works because the links between the nodes determine the probability distribution at the node. Using it seems straightforward. So all that remains is to find out the best network structure to use. Given a large number of attributes, there's a LARGE number of possible networks... Bayesian Networks Classification: Bayes February 17, 2009 Slide 21 COMP527: Data Mining

We need two components:  Evaluate a network based on the data As always we need to find a system that measures the 'goodness' without overfitting (overfitting in this case = too many edges) We need a penalty for the complexity of the network.  Search through the space of possible networks As we know the nodes, we need to find where the edges in the graph are. Which nodes connect to which other nodes? Training Bayesian Networks Classification: Bayes February 17, 2009 Slide 22 COMP527: Data Mining

Following the Minimum Description Length ideal, networks with lots of edges will be more complex, and hence likely to over-fit. We could add a penalty for each cell in the nodes' tables. AIC: -LL +K MDL: -LL + K/2 log(N)‏ LL is total log-likelihood of the network and training set. eg Sum of log of probabilities for each instance in the data set. K is the number of cells in tables, minus the number of cells in the last row (which can be calculated, by 1- sum of other cells in row)‏ N is the number of instances in the data. Training Bayesian Networks Classification: Bayes February 17, 2009 Slide 23 COMP527: Data Mining

K2: for each node, for each previous node, add node, calculate worth continue when doesn't improve (Use MDL or AIC to determine worth)‏ The results of K2 depend on initial order selected to process the nodes in. Run it several times with different orders and select the best. Can help to ensure that the class attribute is first and links to all nodes (not a requirement)‏ Network Training: K2 Classification: Bayes February 17, 2009 Slide 24 COMP527: Data Mining

TAN: Tree Augmented Naive Bayes. Class attribute is only parent for each node in Naive Bayes. Start here and consider adding a second parent to each node. Bayesian Multinet: Build a separate network for each class and combine the values. Other Structures Classification: Bayes February 17, 2009 Slide 25 COMP527: Data Mining

Witten 4.2, 6.7 Han 6.4 Dunham 4.2 Devijver and Kittler, Pattern Recognition: A Statistical Approach, Chapter 2 Berry and Browne, Chapter 2 Further Reading Classification: Bayes February 17, 2009 Slide 26 COMP527: Data Mining