Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.

Slides:

Advertisements

Similar presentations

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Advertisements

Evaluating Classifiers

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Learning Algorithm Evaluation

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Chapter 7 – Classification and Regression Trees

What is Statistical Modeling

Assessing and Comparing Classification Algorithms Introduction Resampling and Cross Validation Measuring Error Interval Estimation and Hypothesis Testing.

Introduction to Predictive Learning

Experimental Evaluation

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ensemble Learning (2), Tree and Forest

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

Today Evaluation Measures Accuracy Significance Testing

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Evaluating Classifiers

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Evaluation – next steps

Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.

Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )

Chapter 9 – Classification and Regression Trees

Experimental Evaluation of Learning Algorithms Part 1.

Learning from Observations Chapter 18 Through

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Evaluating Predictive Models Niels Peek Department of Medical Informatics Academic Medical Center University of Amsterdam.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Text Clustering Hongning Wang

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Validation methods.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

NTU & MSRA Ming-Feng Tsai

A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.

Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

KNN & Naïve Bayes Hongning Wang

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

7. Performance Measurement

Evaluating Classifiers

Evaluation of IR Systems

Learning Algorithm Evaluation

Model generalization Brief summary of methods

Information Retrieval

Dr. Sampath Jayarathna Cal Poly Pomona

Text Categorization Hongning Wang

Recap: Naïve Bayes classifier

Dr. Sampath Jayarathna Cal Poly Pomona

Presentation transcript:

Text Categorization Hongning Wang

Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization – Feature selection methods – Evaluation metrics 6501: Text Mining2

Text mining in general Text Mining3 AccessMining Organization Filter information Discover knowledge Add Structure/Annotations Serve for IR applications Based on NLP/ML techniques Sub-area of DM research

Applications of text categorization Automatically classify politic news from sports news 6501: Text Mining4 politicalsports

Applications of text categorization Recognizing spam s 6501: Text Mining5 Spam=True/False

Applications of text categorization Sentiment analysis 6501: Text Mining6

Basic notions about categorization vector space representation 6501: Text Mining7 Key question: how to find such a mapping?

Bayes decision theory 6501: Text Mining8

Bayes risk 6501: Text Mining9 False positive False negative

Bayes risk Risk – assign instance to a wrong class 6501: Text Mining10 False positive False negative Bayes decision boundary

Bayes risk Risk – assign instance to a wrong class 6501: Text Mining11 False positive False negative Increased error

Bayes risk Risk – assign instance to a wrong class 6501: Text Mining12 False positive False negative *Optimal Bayes decision boundary

Bayes risk 6501: Text Mining13

Loss function The penalty we will pay when misclassifying instances Goal of classification in general – Minimize loss 6501: Text Mining14 Will this still be optimal?

Supervised text categorization Supervised learning – Estimate a model/method from labeled data – It can then be used to determine the labels of the unobserved samples Sports Business Education … Sports Business Education Science … Training Testing 6501: Text Mining15

Type of classification methods Model-less – Instance based classifiers Use observation directly E.g., kNN … Sports Business Education Science … Sports Business Education Training Testing Instance lookup 6501: Text Mining16 Key: assuming similar items have similar class labels!

Type of classification methods … Sports Business Education Science … Sports Business Education Training Testing 6501: Text Mining17 Key: i.i.d. assumption!

Generative V.S. discriminative models Binary classification as an example 6501: Text Mining18 Generative Model’s viewDiscriminative Model’s view

Generative V.S. discriminative models GenerativeDiscriminative 6501: Text Mining19

General steps for text categorization 1.Feature construction and selection 2.Model specification 3.Model estimation and selection 4.Evaluation Political News Sports News Entertainment News 6501: Text Mining20

Recap: Bayes risk Risk – assign instance to a wrong class 6501: Text Mining21 False positive False negative *Optimal Bayes decision boundary

Recap: Bayes risk Risk – assign instance to a wrong class 6501: Text Mining22 False positive False negative Increased error

General steps for text categorization 1.Feature construction and selection 2.Model specification 3.Model estimation and selection 4.Evaluation Political News Sports News Entertainment News Consider: 1.1 How to represent the text documents? 1.2 Do we need all those features? 6501: Text Mining23

Feature construction for text categorization Vector space representation – Standard procedure in document representation – Features N-gram, POS tags, named entities, topics – Feature value Binary (presence/absence) TF-IDF (many variants) 6501: Text Mining24

Recall MP1 How many unigram+bigram are there in our controlled vocabulary? – 130K on Yelp_small How many review documents do we have there for training? – 629K Yelp_small Very sparse feature representation! 6501: Text Mining25

Feature selection for text categorization Select the most informative features for model training – Reduce noise in feature representation Improve final classification performance – Improve training/testing efficiency Less time complexity Fewer training data 6501: Text Mining26

Feature selection methods Wrapper method – Find the best subset of features for a particular classification method R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) the same classifier 6501: Text Mining27

R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) Feature selection methods Wrapper method – Search in the whole space of feature groups Sequential forward selection or genetic search to speed up the search 6501: Text Mining28

Feature selection methods Wrapper method – Consider all possible dependencies among the features – Impractical for text categorization Cannot deal with large feature set A NP-complete problem – No direct relation between feature subset selection and evaluation 6501: Text Mining29

Feature selection methods Filter method – Evaluate the features independently from the classifier and other features No indication of a classifier’s performance on the selected features No dependency among the features – Feasible for very large feature set Usually used as a preprocessing step R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) : Text Mining30

Feature scoring metrics Document frequency – Rare words: non-influential for global prediction, reduce vocabulary size riskier to remove head words safer to remove rare words 6501: Text Mining31

Feature scoring metrics Information gain – Decrease in entropy of categorical prediction when the feature is present v.s. absent class uncertainty decreasesclass uncertainty intact 6501: Text Mining32

Feature scoring metrics Information gain – Decrease in entropy of categorical prediction when the feature is presence or absent Entropy of class label along 6501: Text Mining33

Feature scoring metrics 6501: Text Mining34

Feature scoring metrics : Text Mining35

Feature scoring metrics 6501: Text Mining36

Feature scoring metrics Distribution assumption becomes inappropriate in this test 6501: Text Mining37

Feature scoring metrics 6501: Text Mining38

Recall MP1 How many unigram+bigram are there in our controlled vocabulary? – 130K on Yelp_small How many review documents do we have there for training? – 629K Yelp_small Very sparse feature representation! 6501: Text Mining39

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, A graphical analysis of feature selection Isoclines for each feature scoring metric – Machine learning papers v.s. other CS papers Zoom in Stopword removal 6501: Text Mining40

A graphical analysis of feature selection Isoclines for each feature scoring metric – Machine learning papers v.s. other CS papers Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, : Text Mining41

Effectiveness of feature selection methods On a multi-class classification data set – 229 documents, 19 classes – Binary feature, SVM classifier Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, : Text Mining42

Effectiveness of feature selection methods On a multi-class classification data set – 229 documents, 19 classes – Binary feature, SVM classifier Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, : Text Mining43

Empirical analysis of feature selection methods Text corpus – Reuters documents, 92 classes, unique words – OHSUMED 3981 documents, classes, unique words Classifier: kNN and LLSF 6501: Text Mining44

General steps for text categorization 1.Feature construction and selection 2.Model specification 3.Model estimation and selection 4.Evaluation Political News Sports News Entertainment News Consider: 2.1 What is the unique property of this problem? 2.2 What type of classifier we should use? 6501: Text Mining45

Model specification We will discuss these choices later 6501: Text Mining46

General steps for text categorization 1.Feature construction and selection 2.Model specification 3.Model estimation and selection 4.Evaluation Political News Sports News Entertainment News Consider: 3.1 How to estimate the parameters in the selected model? 3.2 How to control the complexity of the estimated model? 6501: Text Mining47

Model estimation and selection General philosophy – Loss minimization Empirically estimated from training set Empirical loss! 6501: Text Mining48 Key assumption: Independent and Identically Distributed!

Empirical loss minimization Overfitting – Good empirical loss, terrible generalization loss – High model complexity -> prune to overfit noise Underlying dependency: linear relation Over-complicated dependency assumption: polynomial 6501: Text Mining49

Generalization loss minimization Avoid overfitting – Measure model complexity as well – Model selection and regularization Error on training set Error on testing set Model complexity Error 6501: Text Mining50

Generalization loss minimization Cross validation – Avoid noise in train/test separation – k-fold cross-validation 1.Partition all training data into k equal size disjoint subsets; 2.Leave one subset for validation and the other k-1 for training; 3.Repeat step (2) k times with each of the k subsets used exactly once as the validation data. 6501: Text Mining51

Generalization loss minimization Cross validation – Avoid noise in train/test separation – k-fold cross-validation 6501: Text Mining52

Generalization loss minimization Cross validation – Avoid noise in train/test separation – k-fold cross-validation Choose the model (among different models or same model with different settings) that has the best average performance on the validation sets Some statistical test is needed to decide if one model is significantly better than another Will cover it shortly 6501: Text Mining53

General steps for text categorization 1.Feature construction and selection 2.Model specification 3.Model estimation and selection 4.Evaluation Political News Sports News Entertainment News Consider: 4.1 How to judge the quality of learned model? 4.2 How can you further improve the performance? 6501: Text Mining54

Classification evaluation 6501: Text Mining55

Evaluation of binary classification true positive (TP)false positive (FP) false negative (FN)true negative (TN) Recall= Precision= 6501: Text Mining56

Evaluation of binary classification 6501: Text Mining57 False positive False negative *Optimal Bayes decision boundary True positive True negative

Recap: generalization loss minimization Avoid overfitting – Measure model complexity as well – Model selection and regularization Error on training set Error on testing set Model complexity Error 6501: Text Mining58

Recap: generalization loss minimization Cross validation – Avoid noise in train/test separation – k-fold cross-validation 6501: Text Mining59

Recap: evaluation of binary classification 6501: Text Mining60 False positive False negative *Optimal Bayes decision boundary True positive True negative

Precision and recall trade off Precision decreases as the number of documents predicted to be positive increases (unless in perfect classification), while recall keeps increasing These two metrics emphasize different perspectives of a classifier – Precision: prefers a classifier to recognize fewer documents, but highly accurate – Recall: prefers a classifier to recognize more documents 6501: Text Mining61

Summarizing precision and recall Equal weight between precision and recall 6501: Text Mining62 HA

Summarizing precision and recall With a curve 1.Order all the testing cases by the classifier’s prediction score (assuming the higher the score is, the more likely it is positive); 2.Scan through each testing case: treat all cases above it as positive (including itself), below it as negative; compute precision and recall; 3.Plot precision and recall computed for each testing case in step (2). 6501: Text Mining63

Summarizing precision and recall With a curve – A.k.a., precision-recall curve – Area Under Curve (AUC) Under each recall level, we prefer a higher precision 6501: Text Mining64

Multi-class categorization Confusion matrix – A generalized contingency table for precision and recall 6501: Text Mining65

Statistical significance tests How confident you are that an observed difference doesn’t simply result from the train/test separation you chose? Classifier Classifier Fold Average : Text Mining66

Background knowledge p-value in statistic test is the probability of obtaining data as extreme as was observed, if the null hypothesis were true (e.g., if observation is totally random) If p-value is smaller than the chosen significance level (  ), we reject the null hypothesis (e.g., observation is not random) We seek to reject the null hypothesis (we seek to show that the observation is not a random result), and so small p-values are good : Text Mining67

Paired t-test – Test if two sets of observations are significantly different from each other On k-fold cross validation, different classifiers are applied onto the same train/test separation – Hypothesis: difference between two responses measured on the same statistical unit has a zero mean value One-tail v.s. two-tail? – If you aren’t sure, use two-tail 6501: Text Mining68 Algorithm 1Algorithm 2

Statistical significance test 0 95% of outcomes 6501: Text Mining69 Classifier A Classifier B Fold Average paired t-test p=

What you should know Bayes decision theory – Bayes risk minimization General steps for text categorization – Text feature construction – Feature selection methods – Model specification and estimation – Evaluation metrics 6501: Text Mining70

Today’s reading Introduction to Information Retrieval – Chapter 13: Text classification and Naive Bayes 13.1 – Text classification problem 13.5 – Feature selection 13.6 – Evaluation of text classification 6501: Text Mining71