Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.

Slides:



Advertisements
Similar presentations
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
What is Statistical Modeling
Assuming normally distributed data! Naïve Bayes Classifier.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 A Confidence Interval for the Misclassification Rate S.A. Murphy & E.B. Laber.
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1/49 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 9 Estimation: Additional Topics.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Exercise Session 10 – Image Categorization
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Bayesian Networks. Male brain wiring Female brain wiring.
Statistics for Data Miners: Part I (continued) S.T. Balke.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.
Spam Detection Ethan Grefe December 13, 2013.
11/18/2015 IENG 486 Statistical Quality & Process Control 1 IENG Lecture 07 Comparison of Location (Means)
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Machine Learning Chapter 5. Evaluating Hypotheses
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Inference: Probabilities and Distributions Feb , 2012.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Class Imbalance in Text Classification
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Document Filtering Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Collective Intelligence Week 6: Document Filtering Old Dominion University Department of Computer Science CS 795/895 Spring 2009 Michael L. Nelson 2/18/09.
Step 1: Specify a null hypothesis
Michael L. Nelson CS 432/532 Old Dominion University
Document Filtering Social Web 3/17/2010 Jae-wook Ahn.
Naive Bayes Classifier
Michael L. Nelson CS 495/595 Old Dominion University
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
IENG 486 Statistical Quality & Process Control
Text Categorization Assigning documents to a fixed set of categories
Naïve Bayes Classifiers
Sampling Distributions (§ )
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall 2012 Hany SalahEldeen Khalil Presented & Prepared by: Justin F. Brunelle

Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Chapter 6: “Document Filtering”

Document Filtering Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 In a nutshell: It is classifying documents based on their content. This classification could be binary (good/bad, spam/not-spam) or n-ary (school- related- s, work-related, commercials…etc)

Why do we need Document filtering? Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Eliminate spam. Removing unrelated comments in forums and public message boards. Classifying social /work-related s automatically. Forwarding information-request s to the expert who is most capable of answering the .

Spam Filtering Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 First it was rule-based classifiers: Overuse capital letters Words related to pharmaceutical products Garish HTML colors

Cons of using Rule-based classifiers Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Easy to trick by just avoiding patterns of capital letters…etc. What is considered spam varies from one to another. Ex: Inbox of a medical rep Vs. of a house-wife.

Solution Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Develop programs that learn. Teach them the differences and how to recognize each class by providing examples of each class.

Features Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 We need to extract features from documents to classify them. Feature: Is anything that you can determine as being either present or absent in the item.

Definitions Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 item = document feature = word classification = {good|bad}

Dictionary Building Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Dictionary Building Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Remember: Removing capital letters reduce the total number of features by removing the SHOUTING style. Size of the features also is crucial (using entire as feature Vs. each letter a feature)

Classifier Training Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 It is designed to start off very uncertain. Increase certainty upon learning features.

Classifier Training Hany SalahEldeen CS495 – Python & Web Mining Fall 2012

Probabilities Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 It’s a number between 0-1 indicating how likely an event is.

Probabilities Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 ‘quick’ appeared in 2 documents as good and the total number of good documents is 3

Conditional Probabilities Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Pr(A|B) = “probability of A given B” fprob(quick|good) = “probability of quick given good” = (quick classified as good) / (total good items) = 2 / 3

Starting with Reasonable guess Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Using the info we seen so far makes it extremely sensitive in early training stages Ex: “money” Money appeared in casino training document as bad It appears with probability = 0 for good which is not right!

Solution: Start with assumed probability Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Start for instance with 0.5 probability for each feature Also decide the weight chosen for the assumed probability you will take.

Assumed Probability Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.fprob('money','bad') 0.5 >>> cl.fprob('money','good') 0.0 we have data for bad, but should we start with 0 probability for money given good? >>> cl.weightedprob('money','good',cl.fprob) 0.25 >>> docclass.sampletrain(cl) Nobody owns the water. the quick rabbit jumps fences buy pharmaceuticals now make quick money at the online casino the quick brown fox jumps >>> cl.weightedprob('money','good',cl.fprob) >>> cl.fcount('money','bad') 3.0 >>> cl.weightedprob('money','bad',cl.fprob) 0.5 define an assumed probability of 0.5 then weightedprob() returns the weighted mean of fprob and the assumed probability weightedprob(money,good) = (weight * assumed + count * fprob()) / (count + weight) = (1* *0) / (1+1) = 0.5 / 2 = 0.25 (double the training) = (1* *0) / (2+1) = 0.5 / 3 = Pr(money|bad) remains = ( *0.5) / (3+1) = 0.5

Naïve Bayesian Classifier Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Move from terms to documents: Pr(document) = Pr(term 1 ) * Pr(term 2 ) * … * Pr(term n ) Naïve because we assume all terms occur independently we know this is as simplifying assumption; it is naïve to think all terms have equal probability for completing this phrase: “Shave and a hair cut ___ ____” Bayesian because we use Bayes’ Theorem to invert the conditional probabilities

Bayes Theorem Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Given our training data, we know: Pr(feature|classification) What we really want to know is: Pr(classification|feature) Bayes’ Theorem* : Pr(A|B) = Pr(B|A) Pr(A) / Pr(B) Pr(good|doc) = Pr(doc|good) Pr(good) / Pr(doc) * we know how to calculate this #good / #total we skip this since it is the same for each classification Or:

Our Bayesian Classifier Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> import docclass >>> cl=docclass.naivebayes(docclass.getwords) >>> docclass.sampletrain(cl) Nobody owns the water. the quick rabbit jumps fences buy pharmaceuticals now make quick money at the online casino the quick brown fox jumps >>> cl.prob('quick rabbit','good') quick rabbit >>> cl.prob('quick rabbit','bad') quick rabbit >>> cl.prob('quick rabbit jumps','good') quick rabbit jumps >>> cl.prob('quick rabbit jumps','bad') quick rabbit jumps we use these values only for comparison, not as “real” probabilities

Bayesian Classifier Hany SalahEldeen CS495 – Python & Web Mining Fall _classifier#Testing _classifier#Testing

Classification Thresholds Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.prob('quick rabbit','good') quick rabbit >>> cl.prob('quick rabbit','bad') quick rabbit >>> cl.classify('quick rabbit',default='unknown') quick rabbit u'good' >>> cl.prob('quick money','good') quick money >>> cl.prob('quick money','bad') quick money >>> cl.classify('quick money',default='unknown') quick money u'bad' >>> cl.setthreshold('bad',3.0) >>> cl.classify('quick money',default='unknown') quick money 'unknown' >>> cl.classify('quick rabbit',default='unknown') quick rabbit u'good' only classify something as bad if it is 3X more likely to be bad than good

Classification Thresholds…cont Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> for i in range(10): docclass.sampletrain(cl) >>> cl.prob('quick money','good') quick money >>> cl.prob('quick money','bad') quick money >>> cl.classify('quick money',default='unknown') quick money u'bad' >>> cl.prob('quick rabbit','good') quick rabbit >>> cl.prob('quick rabbit','bad') quick rabbit >>> cl.classify('quick rabbit',default='unknown') quick rabbit u'good'

Fisher Method Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Normalize the frequencies for each category e.g., we might have far more “bad” training data than good, so the net cast by the bad data will be “wider” than we’d like Calculate normalized Bayesian probability, then fit the result to an inverse chi-square function to see what is the probability that a random document of that classification would have those features (i.e., terms)

Fisher Example Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> import docclass >>> cl=docclass.fisherclassifier(docclass.getwords) >>> cl.setdb('mln.db') >>> docclass.sampletrain(cl) >>> cl.cprob('quick','good') >>> cl.fisherprob('quick','good') quick >>> cl.fisherprob('quick rabbit','good') quick rabbit >>> cl.cprob('rabbit','good') 1.0 >>> cl.fisherprob('rabbit','good') rabbit 0.75 >>> cl.cprob('quick','good') >>> cl.cprob('quick','bad')

Fisher Example Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.cprob('money','good') 0 >>> cl.cprob('money','bad') 1.0 >>> cl.cprob('buy','bad') 1.0 >>> cl.cprob('buy','good') 0 >>> cl.fisherprob('money buy','good') money buy >>> cl.fisherprob('money buy','bad') money buy >>> cl.fisherprob('money quick','good') money quick >>> cl.fisherprob('money quick','bad') money quick

Classification with Inverse Chi-Square Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 >>> cl.fisherprob('quick rabbit','good') quick rabbit >>> cl.classify('quick rabbit') quick rabbit u'good' >>> cl.fisherprob('quick money','good') quick money >>> cl.classify('quick money') quick money u'bad' >>> cl.setminimum('bad',0.8) >>> cl.classify('quick money') quick money u'good' >>> cl.setminimum('good',0.4) >>> cl.classify('quick money') quick money u'good' >>> cl.setminimum('good',0.42) >>> cl.classify('quick money') quick money this version of the classifier does not print “unknown” as a classification in practice, we’ll tolerate false positives for “good” more than false negatives for “good” -- we’d rather see a mesg that is spam rather than lose a mesg that is not spam.

Fisher -- Simplified Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Reduces the signal – to – noise ratios Assumes document occur with normal distribution Estimates differences in corpus size with X- squared “Chi”-squared is a “goodness-of-fit” b/t an observed distribution and theoretical distribution Utilizes confidence interval & std. dev. estimations for a corpus square_pdf.svg&page=1 square_pdf.svg&page=1

Assignment 4 Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Pick one question from the end of the chapter. Implement the function and state briefly the differences. Utilize the python files associated with the class if needed. Deadline: Next week