Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Evaluation of Decision Forests on Text Categorization
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
What is Statistical Modeling
Learning for Text Categorization
Evaluating Search Engine
Model Evaluation Metrics for Performance Evaluation
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Lecture 13-1: Text Classification & Naive Bayes
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
ROC Curves.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Experimental Evaluation
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluating Classifiers
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Processing of large document collections Part 2. Feature selection: IG zInformation gain: measures the number of bits of information obtained for category.
Learning from Observations Chapter 18 Through
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
CpSc 810: Machine Learning Evaluation of Classifier.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Classification Techniques: Bayesian Classification
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Feature selection for text categorization on imbalanced.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Bab /57 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 2 Model Overfitting & Classifier Evaluation.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Sampath Jayarathna Cal Poly Pomona
7. Performance Measurement
How to forecast solar flares?
Evaluating Classifiers
Instance Based Learning
7CCSMWAL Algorithmic Issues in the WWW
Chapter 6 Classification and Prediction
Roc curves By Vittoria Cozza, matr
Presentation transcript:

Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006

2 Evaluation of text classifiers evaluation of document classifiers is typically conducted experimentally, rather than analytically reason: in order to evaluate a system analytically, we would need a formal specification of the problem that the system is trying to solve text categorization is non-formalisable

3 Evaluation the experimental evaluation of a classifier usually measures its effectiveness (rather than its efficiency) –effectiveness= ability to take the right classification decisions –efficiency= time and space requirements

4 Evaluation after a classifier is constructed using a training set, the effectiveness is evaluated using a test set the following counts are computed for each category i: –TP i : true positives –FP i : false positives –TN i : true negatives –FN i : false negatives

5 Evaluation TP i : true positives w.r.t. category c i –the set of documents that both the classifier and the previous judgments (as recorded in the test set) classify under c i FP i : false positives w.r.t. category c i –the set of documents that the classifier classifies under c i, but the test set indicates that they do not belong to c i

6 Evaluation TN i : true negatives w.r.t. c i –both the classifier and the test set agree that the documents in TN i do not belong to c i FN i : false negatives w.r.t. c i –the classifier do not classify the documents in FN i under c i, but the test set indicates that they should be classified under c i

7 Evaluation measures Precision wrt c i Recall wrt c i TPiFPiFNi TNi Classified CiTest Class Ci

8 Evaluation measures for obtaining estimates for precision and recall in the collection as a whole (= all categories), two different methods may be adopted: –microaveraging counts for true positives, false positives and false negatives for all categories are first summed up precision and recall are calculated using the global values –macroaveraging average of precision (recall) for individual categories

9 Evaluation measures microaveraging and macroaveraging may give quite different results, if the different categories are of very different size –e.g. the ability of a classifier to behave well also on small categories (i.e. categories with few positive training instances) will be emphasized by macroaveraging choice depends on the application

10 Combined effectiveness measures neither precision nor recall makes sense in isolation of each other the trivial acceptor (each document is classified under each category) has a recall = 1 –in this case, precision would usually be very low higher levels of precision may be obtained at the price of lower values of recall

11 Trivial acceptor Precision wrt c i Recall wrt c i TPiFPi Classified Ci: allTest Class Ci

12 Combined effectiveness measures a classifier should be evaluated by means of a measure which combines recall and precision some combined measures: –11-point average precision –the breakeven point –F1 measure

13 11-point average precision in constructing the classifier, the threshold is repeatedly tuned so as to allow recall (for the category) to take up values 0.0, 0.1., …, 0.9, 1.0. precision (for the category) is computed for these 11 different values of precision, and averaged over the 11 resulting values

14 _ _ _ __ _ _ _ _ _ _ _ _ _ _

15 Recall-precision curve 100% 0% recall 100% precisionprecision.....

16 Breakeven point process analoguous to the one used for 11- point average precision –precision as a function of recall is computed by repeatedly varying the thresholds breakeven point is the value where precision equals recall

17 F 1 measure F 1 measure is defined as: for the trivial acceptor,   0 and  = 1, F 1  0

18 Effectiveness once an effectiveness measure is chosen, a classifier can be tuned (e.g. thresholds and other parameters can be set) so that the resulting effectiveness is the best achievable by that classifier

19 Evaluation measures efficiency (= time and space requirements) –seldom used, although important for real- life applications –difficult to compare systems: environment parameters change –two parts training efficiency = average time it takes to build a classifier for a category from a training set classification efficiency = average time it takes to classify a new document under a category

20 Conducting experiments in general, different sets of experiments may be used for cross-classifier comparison only if the experiments have been performed –on exactly the same collection (same documents and same categories) –with the same split between training set and test set –with the same evaluation measure

21 Term selection a large document collection may contain millions of words -> document vectors would contain millions of dimensions –many algorithms cannot handle high dimensionality of the term space (= large number of terms) –very specific terms may lead to overfitting: the classifier can classify the documents in the training data well but fails often with unseen documents

22 Term selection usually only a part of terms is used how to select terms that are used? –term selection (often called feature selection or dimensionality reduction) methods

23 Term selection goal: select terms that yield the highest effectiveness in the given application wrapper approach –a candidate set of terms is found and tested with the application –iteration: based on the test results, the set of terms is modified and tested again until the set is optimal filtering approach –keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task

24 Term selection many functions available –document frequency: keep the high frequency terms stopwords have been already removed 50% of the words occur only once in the document collection e.g. remove all terms occurring in at most 3 documents

25 Term selection functions: document frequency document frequency is the number of documents in which a term occurs in our sample, the ranking of terms: –9 current –7 project –4 environment –3 nuclear –2 application –2 area … 2 water –1 use …

26 Term selection functions: document frequency we might now set the threshold to 2 and remove all the words that occur only once result: 29 words of 118 words (~25%) selected

27 Term selection: other functions information-theoretic term selection functions, e.g. –chi-square –information gain –mutual information –odds ratio –relevancy score

28 Term selection: information gain (IG) information gain: measures the (number of bits of) information obtained for category prediction by knowing the presence or absence of a term in a document information gain is calculated for each term and the best n terms with highest values are selected

29 Term selection: IG information gain for term t: –m: the number of categories

30 Term selection: estimating probabilities Doc 1: cat cat cat (c) Doc 2: cat cat cat dog (c) Doc 3: cat dog mouse (~c) Doc 4: cat cat cat dog dog dog (~c) Doc 5: mouse (~c) 2 classes: c and ~c

31 Term selection: estimating probabilities P(t): probability of a term t –P(cat) = 4/5, or ‘cat’ occurs in 4 docs of 5 –P(cat) = 10/17 the proportion of the occurrences of ´cat’ of the all term occurrences

32 Term selection: estimating probabilities P(~t): probability of the absence of t –P(~cat) = 1/5, or –P(~cat) = 7/17

33 Term selection: estimating probabilities P(c i ): probability of category i –P(c) = 2/5 (the proportion of documents belonging to c in the collection), or –P(c) = 7/17 (7 of the 17 terms occur in the documents belonging to c)

34 Term selection: estimating probabilities P(c i | t): probability of category i if t is in the document; i.e., which proportion of the documents where t occurs belong to the category i –P(c | cat) = 2/4 (or 6/10) –P(~c | cat) = 2/4 (or 4/10) –P(c | mouse) = 0 –P(~c | mouse) = 1

35 Term selection: estimating probabilities P(c i | ~t): probability of category i if t is not in the document; i.e., which proportion of the documents where t does not occur belongs to the category i –P(c | ~cat) = 0 (or 1/7) –P(c | ~dog) = ½ (or 6/12) –P(c | ~mouse) = 2/3 (or 7/15)

36 Term selection: estimating probabilities in other words... assume –term t occurs in B documents, A of them are in category c –category c has D documents, of the whole of N documents in the collection

37 Term selection: estimating probabilities c docs containing t N documents B documents A documents D documents

38 Term selection: estimating probabilities for instance, –P(t): B/N –P(~t): (N-B)/N –P(c): D/N –P(c|t): A/B –P(c|~t): (D-A)/(N-B)

39 Term selection: IG information gain for term t: –m: the number of categories

40 p(c) = 2/5, p(~c) = 3/5 p(cat) = 4/5, p(~cat) = 1/5, p(dog) = 3/5, p(~dog) = 2/5, p(mouse) = 2/5, p(~mouse) = 3/5 p(c|cat) = 2/4, p(~c|cat) = 2/4, p(c|~cat) = 0, p(~c|~cat) = 1 p(c|dog) = 1/3, p(~c|dog) = 2/3, p(c|~dog) = 1/2, p(~c|~dog) = 1/2 p(c|mouse) = 0, p(~c|mouse) = 1, p(c|~mouse) = 2/3, p(~c|~mouse) = 1/3 -(p(c) log p(c) + p(~c) log p(~c)) = -(2/5 log 2/5 + 3/5 log 3/5) = -(2/5 (log 2 – log 5) + 3/5 (log 3 – log 5)) = -(2/5 (1 – log 5) + 3/5 (log 3 – log 5)) = -(2/5 + 3/5 log 3 – log 5) = -( – 2.33) = 0.97 (log base = 2) p(cat) (p(c|cat) log p(c|cat) + p(~c|cat) log p(~c|cat)) = 4/5 (1/2 log 1/2 + 1/2 log 1/2) = 4/5 log 1/2 = 4/5 (log 1 – log 2) = 4/5 (0 – 1) = -0.8 p(~cat) (p(c|~cat) log p(c|~cat) + p(~c|~cat) log p(~c|~cat)) = 1/5 (0 + 1 log 1) = 0 G(cat) = 0.97 – 0.8 – 0 = 0.17

41 p(dog) (p(c|dog) log p(c|dog) + p(~c|dog) log p(~c|dog)) = 3/5(1/3 log 1/3 + 2/3 log 2/3) = 3/5 ( 1/3 (log 1 – log 3) + 2/3 (log2 - log 3)) = 3/5 (-1/3 log 3 – 2/3 log 3 + 2/3) = 3/5(-log 3 + 2/3) = 0.6 ( ) = p(~dog) (p(c|~dog) log p(c|~dog) + p(~c|~dog) log p(~c|~dog)) = 2/5 (1/2 log ½ + ½ log ½) = 2/5 (log 1 – log 2) = -0.4 G(dog) = 0.97 – 0.55 – 0.4 = 0.02 p(mouse) (p(c|mouse) log p(c|mouse) + p(~c|mouse) log p(~c|mouse)) = 2/5 (0 + 1 log 1) = 0 p(~mouse) (p(c|~mouse) log p(c|~mouse) + p(~c|~mouse) log p(~c|~mouse)) = 3/5 ( 2/3 log 2/3 + 1/3 log 1/3) = G(mouse) = 0.97 – 0 – 0.55 = 0.42 ranking: 1. mouse 2. cat 3. dog

42 Example, some intuitive remarks ’mouse’ is the best, since it occurs in ~c documents only ’cat’ is good, since if it does not occur, the category is always ~c ’cat’ is not good, since half of the documents in which ’cat’ occurs are in c, half are in ~c ’dog’ is the worst, since if it occurs, the category can be either c or ~c, and if it does not occur, the category can also be either c or ~c