M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.

M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009 Slide 1 COMP527: Data Mining

Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Classification: Bayes February 17, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

Statistical Modeling Bayes Rule Naïve Bayes Fixes to Naïve Bayes Document classification Bayesian Networks Structure Learning Today's Topics Classification: Bayes February 17, 2009 Slide 3 COMP527: Data Mining

The probability of hypothesis H, given evidence E: Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[H] = A Priori probability of H (before evidence seen)‏ Pr[H|E] = A Posteriori probability of H (after evidence seen) We want to use this in a classification system, so our goal is to find the most probable hypothesis (class) given the evidence (test instance). Bayes Rule Classification: Bayes February 17, 2009 Slide 4 COMP527: Data Mining

Meningitis causes a stiff neck 50% of the time. Meningitis occurs 1/50,000, stiff necks occur 1/20. Probability of Meningitis, given that the patient has a stiff neck: Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN] = 0.5 * 1/50000 / 1/20 = 0.0002 Example Classification: Bayes February 17, 2009 Slide 5 COMP527: Data Mining

Our evidence E is made up of different attributes A [1..n], so: Pr[H|E] = Pr[A 1 |H]*Pr[A 2 |H]...Pr[An|H]*Pr[H]/Pr[E] So we need to work out the probability of the individual attributes per class. Easy... Outlook=Sunny appears twice for yes out of 9 yes instances. We can work these out for all of our training instances... Bayes Rule Classification: Bayes February 17, 2009 Slide 6 COMP527: Data Mining

Given a test instance (sunny, cool, high, true) play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053 play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206 So we'd predict play=no for that particular instance. Weather Probabilities Classification: Bayes February 17, 2009 Slide 7 COMP527: Data Mining

play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053 play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206 This is the likelihood, not the probability. We need to normalise these. Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5% This is when the Pr[E] denominator disappears from Bayes's rule. Nice. Surely there's more to it than this... ? Weather Probabilities Classification: Bayes February 17, 2009 Slide 8 COMP527: Data Mining

Issue: It's only valid to multiply probabilities when the events are independent of each other. It is “naïve” to assume independence between attributes in datasets, hence the name. Eg: The probability of Liverpool winning a football match is not independent of the probabilities for each member of the team scoring a goal. But even given that, Naïve Bayes is still very effective in practice, especially if we can eliminate redundant attributes before processing. Naïve Bayes Classification: Bayes February 17, 2009 Slide 9 COMP527: Data Mining

Issue: If an attribute value does not co-occur with a class value, then the probability generated for it will be 0. Eg: Given outlook=overcast, the probability of play=no is 0/5. The other attributes will be ignored as the final result will be multiplied by 0. This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set. You can easily imagine a case where the likelihood for all classes is 0. Eg: 'Viagra' is always spam, 'data mining' is never spam. An email with both will be 0 for spam=yes and 0 for spam=no... probability will be undefined... uh oh! Naïve Bayes Classification: Bayes February 17, 2009 Slide 10 COMP527: Data Mining

The trivial solution is of course to mess with the probabilities such that you never have 0s. We add 1 to the numerator and 3 to the denominator to compensate. So we end up with 1/8 instead of 0/5. No reason to use 3, could use 2 and 6. No reason to split equally... we could add weight to some attributes by giving them a larger share: a+3/na+6 * b+2/nb+6 * c+1/nc+6 However, how to assign these is unclear. For reasonable training sets, simply initialise counts to 1 rather than 0. Laplace Estimator Classification: Bayes February 17, 2009 Slide 11 COMP527: Data Mining

Naïve Bayes deals well with missing values: Training: Ignore the instance for the attribute/class combination, but we can still use it for the known attributes. Classification: Ignore the attribute in the calculation as the difference will be normalised during the final step anyway. Missing Values Classification: Bayes February 17, 2009 Slide 12 COMP527: Data Mining

Naïve Bayes does not deal well with numeric values without some help. The probability of it being exactly 65 degrees is zero. We could discretize the attribute, but instead we'll calculate the mean and standard deviation and use a density function to predict the probability. mean: sum(values) / count(values)‏ variance: sum(square(value - mean)) / count(values)-1 standard deviation: square root of variance Mean for temperature is 73, Std. Deviation is 6.2 Numeric Values Classification: Bayes February 17, 2009 Slide 13 COMP527: Data Mining

Density function: f(x) = Unless you've a math background, just plug the numbers in... At which point we get a likelihood of 0.034 Then we continue with this number as before. This assumes a reasonably normal distribution. Often not the case. Numeric Values Classification: Bayes February 17, 2009 Slide 14 COMP527: Data Mining 2 2 2 )‏)‏( 2 1    xx e

The Bayesian model is often used to classify documents as it deals well with a huge number of attributes simultaneously. (eg boolean occurrence of words within the text)‏ But we may know how many times the word occurs. This leads to Multinomial Naive Bayes. Assumptions: 1. Probability of a word occurring in a document is independent of its location within the document. 2. The document length is not related to the class. Document Classification Classification: Bayes February 17, 2009 Slide 15 COMP527: Data Mining

Pr[E|H] = N! * product(p n /n!)‏ N = number of words in document p = relative frequency of word in documents of class H n = number of occurrences of word in document So, if A has 75% and B has 25% frequency in class H Pr[“A A A”|H] = 3! * 0.75 3 /3! * 0.25 0 /0! = 27/64 = 0.422 Pr[“A A A B B”|H] = 5! * 0.75 3 /3! * 0.25 2 /2! = 0.264 Document Classification Classification: Bayes February 17, 2009 Slide 16 COMP527: Data Mining

Pr[E|H] = N! * product(p n /n!)‏ We don't need to work out all the factorials, as they'll normalise out at the end. We still end up with insanely small numbers, as vocabularies are much much larger than 2 words. Instead we can sum the logarithms of the probabilities instead of multiplying them. Document Classification Classification: Bayes February 17, 2009 Slide 17 COMP527: Data Mining

Back to the attribute independence assumption. Can we get rid of it? Yes, with a Bayesian Network. Each attribute has a node in a Directed Acyclic Graph. Each node has a table of all attributes with edges pointing at the node linked against the probabilities for the attribute's values. Examples will be hopefully enlightening... Bayesian Networks Classification: Bayes February 17, 2009 Slide 18 COMP527: Data Mining

Less Simple Network Classification: Bayes February 17, 2009 Slide 20 COMP527: Data Mining play yes no.633.367 Outlook play sunny overcast rainy yes.238.429.333 no.538.077.385 temperature play outlook hot mild cold yes sunny.238.429.333 yes overcast.385.385.231 yes rainy.111.556.333 no sunny.556.333.111 no overcast.333.333.333 no rainy.143.429.429 windy play outlook false true yes sunny.500.500 yes overcast.500.500 yes rainy.125.875 no sunny.375.625 no overcast.500.500 no rainy.833.167 humidity play temp high normal yes hot.500.500 yes mild.500.500 yes cool.125.875 no hot.833.167 no mild.833.167 no cool.250.750

To use the network, simply step through each node and multiply the results in the table together for the instance's attributes' values. Or, more likely, sum the logarithms as with the multinomial case. Then, as before, normalise them to sum to 1. This works because the links between the nodes determine the probability distribution at the node. Using it seems straightforward. So all that remains is to find out the best network structure to use. Given a large number of attributes, there's a LARGE number of possible networks... Bayesian Networks Classification: Bayes February 17, 2009 Slide 21 COMP527: Data Mining

We need two components:  Evaluate a network based on the data As always we need to find a system that measures the 'goodness' without overfitting (overfitting in this case = too many edges) We need a penalty for the complexity of the network.  Search through the space of possible networks As we know the nodes, we need to find where the edges in the graph are. Which nodes connect to which other nodes? Training Bayesian Networks Classification: Bayes February 17, 2009 Slide 22 COMP527: Data Mining

Following the Minimum Description Length ideal, networks with lots of edges will be more complex, and hence likely to over-fit. We could add a penalty for each cell in the nodes' tables. AIC: -LL +K MDL: -LL + K/2 log(N)‏ LL is total log-likelihood of the network and training set. eg Sum of log of probabilities for each instance in the data set. K is the number of cells in tables, minus the number of cells in the last row (which can be calculated, by 1- sum of other cells in row)‏ N is the number of instances in the data. Training Bayesian Networks Classification: Bayes February 17, 2009 Slide 23 COMP527: Data Mining

K2: for each node, for each previous node, add node, calculate worth continue when doesn't improve (Use MDL or AIC to determine worth)‏ The results of K2 depend on initial order selected to process the nodes in. Run it several times with different orders and select the best. Can help to ensure that the class attribute is first and links to all nodes (not a requirement)‏ Network Training: K2 Classification: Bayes February 17, 2009 Slide 24 COMP527: Data Mining

TAN: Tree Augmented Naive Bayes. Class attribute is only parent for each node in Naive Bayes. Start here and consider adding a second parent to each node. Bayesian Multinet: Build a separate network for each class and combine the values. Other Structures Classification: Bayes February 17, 2009 Slide 25 COMP527: Data Mining

Witten 4.2, 6.7 Han 6.4 Dunham 4.2 Devijver and Kittler, Pattern Recognition: A Statistical Approach, Chapter 2 Berry and Browne, Chapter 2 Further Reading Classification: Bayes February 17, 2009 Slide 26 COMP527: Data Mining

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.

Similar presentations

Presentation on theme: "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.

Similar presentations

Presentation on theme: "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009."— Presentation transcript:

Similar presentations

About project

Feedback