Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall 2012 Hany SalahEldeen Khalil Presented & Prepared by: Justin F. Brunelle
Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Chapter 6: “Document Filtering”
Document Filtering Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 In a nutshell: It is classifying documents based on their content. This classification could be binary (good/bad, spam/not-spam) or n-ary (school- related- s, work-related, commercials…etc)
Why do we need Document filtering? Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Eliminate spam. Removing unrelated comments in forums and public message boards. Classifying social /work-related s automatically. Forwarding information-request s to the expert who is most capable of answering the .
Spam Filtering Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 First it was rule-based classifiers: Overuse capital letters Words related to pharmaceutical products Garish HTML colors
Cons of using Rule-based classifiers Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Easy to trick by just avoiding patterns of capital letters…etc. What is considered spam varies from one to another. Ex: Inbox of a medical rep Vs. of a house-wife.
Solution Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Develop programs that learn. Teach them the differences and how to recognize each class by providing examples of each class.
Features Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 We need to extract features from documents to classify them. Feature: Is anything that you can determine as being either present or absent in the item.
Definitions Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 item = document feature = word classification = {good|bad}
Dictionary Building Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Dictionary Building Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Remember: Removing capital letters reduce the total number of features by removing the SHOUTING style. Size of the features also is crucial (using entire as feature Vs. each letter a feature)
Classifier Training Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 It is designed to start off very uncertain. Increase certainty upon learning features.
Classifier Training Hany SalahEldeen CS495 – Python & Web Mining Fall 2012
Probabilities Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 It’s a number between 0-1 indicating how likely an event is.
Probabilities Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 ‘quick’ appeared in 2 documents as good and the total number of good documents is 3
Conditional Probabilities Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Pr(A|B) = “probability of A given B” fprob(quick|good) = “probability of quick given good” = (quick classified as good) / (total good items) = 2 / 3
Starting with Reasonable guess Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Using the info we seen so far makes it extremely sensitive in early training stages Ex: “money” Money appeared in casino training document as bad It appears with probability = 0 for good which is not right!
Solution: Start with assumed probability Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Start for instance with 0.5 probability for each feature Also decide the weight chosen for the assumed probability you will take.
Assumed Probability Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.fprob('money','bad') 0.5 >>> cl.fprob('money','good') 0.0 we have data for bad, but should we start with 0 probability for money given good? >>> cl.weightedprob('money','good',cl.fprob) 0.25 >>> docclass.sampletrain(cl) Nobody owns the water. the quick rabbit jumps fences buy pharmaceuticals now make quick money at the online casino the quick brown fox jumps >>> cl.weightedprob('money','good',cl.fprob) >>> cl.fcount('money','bad') 3.0 >>> cl.weightedprob('money','bad',cl.fprob) 0.5 define an assumed probability of 0.5 then weightedprob() returns the weighted mean of fprob and the assumed probability weightedprob(money,good) = (weight * assumed + count * fprob()) / (count + weight) = (1* *0) / (1+1) = 0.5 / 2 = 0.25 (double the training) = (1* *0) / (2+1) = 0.5 / 3 = Pr(money|bad) remains = ( *0.5) / (3+1) = 0.5
Naïve Bayesian Classifier Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Move from terms to documents: Pr(document) = Pr(term 1 ) * Pr(term 2 ) * … * Pr(term n ) Naïve because we assume all terms occur independently we know this is as simplifying assumption; it is naïve to think all terms have equal probability for completing this phrase: “Shave and a hair cut ___ ____” Bayesian because we use Bayes’ Theorem to invert the conditional probabilities
Bayes Theorem Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Given our training data, we know: Pr(feature|classification) What we really want to know is: Pr(classification|feature) Bayes’ Theorem* : Pr(A|B) = Pr(B|A) Pr(A) / Pr(B) Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc) * we know how to calculate this #good / #total we skip this since it is the same for each classification Or:
Our Bayesian Classifier Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> import docclass >>> cl=docclass.naivebayes(docclass.getwords) >>> docclass.sampletrain(cl) Nobody owns the water. the quick rabbit jumps fences buy pharmaceuticals now make quick money at the online casino the quick brown fox jumps >>> cl.prob('quick rabbit','good') quick rabbit >>> cl.prob('quick rabbit','bad') quick rabbit >>> cl.prob('quick rabbit jumps','good') quick rabbit jumps >>> cl.prob('quick rabbit jumps','bad') quick rabbit jumps we use these values only for comparison, not as “real” probabilities
Bayesian Classifier Hany SalahEldeen CS495 – Python & Web Mining Fall _classifier#Testing _classifier#Testing
Classification Thresholds Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.prob('quick rabbit','good') quick rabbit >>> cl.prob('quick rabbit','bad') quick rabbit >>> cl.classify('quick rabbit',default='unknown') quick rabbit u'good' >>> cl.prob('quick money','good') quick money >>> cl.prob('quick money','bad') quick money >>> cl.classify('quick money',default='unknown') quick money u'bad' >>> cl.setthreshold('bad',3.0) >>> cl.classify('quick money',default='unknown') quick money 'unknown' >>> cl.classify('quick rabbit',default='unknown') quick rabbit u'good' only classify something as bad if it is 3X more likely to be bad than good
Classification Thresholds…cont Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> for i in range(10): docclass.sampletrain(cl) >>> cl.prob('quick money','good') quick money >>> cl.prob('quick money','bad') quick money >>> cl.classify('quick money',default='unknown') quick money u'bad' >>> cl.prob('quick rabbit','good') quick rabbit >>> cl.prob('quick rabbit','bad') quick rabbit >>> cl.classify('quick rabbit',default='unknown') quick rabbit u'good'
Fisher Method Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Normalize the frequencies for each category e.g., we might have far more “bad” training data than good, so the net cast by the bad data will be “wider” than we’d like Calculate normalized Bayesian probability, then fit the result to an inverse chi-square function to see what is the probability that a random document of that classification would have those features (i.e., terms)
Fisher Example Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> import docclass >>> cl=docclass.fisherclassifier(docclass.getwords) >>> cl.setdb('mln.db') >>> docclass.sampletrain(cl) >>> cl.cprob('quick','good') >>> cl.fisherprob('quick','good') quick >>> cl.fisherprob('quick rabbit','good') quick rabbit >>> cl.cprob('rabbit','good') 1.0 >>> cl.fisherprob('rabbit','good') rabbit 0.75 >>> cl.cprob('quick','good') >>> cl.cprob('quick','bad')
Fisher Example Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.cprob('money','good') 0 >>> cl.cprob('money','bad') 1.0 >>> cl.cprob('buy','bad') 1.0 >>> cl.cprob('buy','good') 0 >>> cl.fisherprob('money buy','good') money buy >>> cl.fisherprob('money buy','bad') money buy >>> cl.fisherprob('money quick','good') money quick >>> cl.fisherprob('money quick','bad') money quick
Classification with Inverse Chi-Square Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.fisherprob('quick rabbit','good') quick rabbit >>> cl.classify('quick rabbit') quick rabbit u'good' >>> cl.fisherprob('quick money','good') quick money >>> cl.classify('quick money') quick money u'bad' >>> cl.setminimum('bad',0.8) >>> cl.classify('quick money') quick money u'good' >>> cl.setminimum('good',0.4) >>> cl.classify('quick money') quick money u'good' >>> cl.setminimum('good',0.42) >>> cl.classify('quick money') quick money this version of the classifier does not print “unknown” as a classification in practice, we’ll tolerate false positives for “good” more than false negatives for “good” -- we’d rather see a mesg that is spam rather than lose a mesg that is not spam.
Fisher -- Simplified Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Reduces the signal – to – noise ratios Assumes document occur with normal distribution Estimates differences in corpus size with X- squared “Chi”-squared is a “goodness-of-fit” b/t an observed distribution and theoretical distribution Utilizes confidence interval & std. dev. estimations for a corpus square_pdf.svg&page=1 square_pdf.svg&page=1
Assignment 4 Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Pick one question from the end of the chapter. Implement the function and state briefly the differences. Utilize the python files associated with the class if needed. Deadline: Next week