WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16.

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16

Today’s Topics Evaluation of classification Vector space methods, Nearest neighbors Decision trees

Measuring Classification Figures of Merit Accuracy of classification Main evaluation criterion in academia Speed of training statistical classifier Speed of classification (docs/hour) No big differences for most algorithms Exceptions: kNN Effort in creating training set (human hours/class) Active Learning can speed up training set creation

Measures of Accuracy Error rate Not a good measure for small classes. Why? Precision/recall for classification decisions F 1 measure: 1/F 1 = ½ (1/P + 1/R) Precision/recall for ranking classes (next slide) Correct estimate of size of category Why is this different? Stability over time / category drift Utility Costs of false positives / false negatives may be different For example, cost = tp-0.5fp

Accuracy Measurement: 1 cat/doc Confusion matrix 53 Class assigned by classifier Actual Class This (i, j) entry means 53 of the docs actually in class i were put in class j by the classifier.

Confusion matrix Function of classifier, classes and test docs. For a perfect classifier, all off-diagonal entries should be zero. For a perfect classifier, if there are n docs in category j than entry (j,j) should be n. Straightforward when there is 1 category per document. Can be extended to n categories per document.

Confusion measures (1 class / doc) Recall: Fraction of docs in class i classified correctly: Precision: Fraction of docs assigned class i that are actually about class i: “Correct rate”: (1- error rate) Fraction of docs classified correctly:

Multiclass Problem: >1 cat /doc Precision/Recall for Ranking Classes Example doc: “Bad wheat harvest in Turkey” True categories Wheat Turkey Ranked category list 0.9: turkey 0.7: poultry 0.5: armenia 0.4: barley 0.3: georgia Precision at 5: 0.2, Recall at 5: 0.5

Precision/Recall for Ranking Classes Consider problems with many categories (>10) Use method returning scores comparable across categories (not: Naïve Bayes) Rank categories and compute average precision (or other measure characterizing precision/recall curve) Good measure for interactive support of human categorization Useless for an “autonomous” system (e.g. a filter on a stream of newswire stories)

Multiclass Problem: >1 cat/doc Micro- vs. Macro-Averaging If we have more than one class, how do we combine multiple performance measures into one quantity? Macroaveraging: Compute performance for each class, then average. Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Micro- vs. Macro-Averaging: Example Truth: yes Truth: no Classifi er: yes 10 Classifi er: no 10970 Truth: yes Truth: no Classifi er: yes 9010 Classifi er: no 10890 Truth: yes Truth: no Classifie r: yes 10020 Classifie r: no 201860 Class 1Class 2Micro.Av. Table Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 =.83 Why this difference?

Variabilities of Measurements Can we trust a computed figure of merit? Variability of data document size/length quality/style of authorship uniformity of vocabulary Variability of “truth” / gold standard need definitive judgement on which class(es) a doc belongs to usually human Ideally: consistent judgments

Vector Space Classification Nearest Neighbor Classification

Classification Methods Naïve Bayes Vector space methods, Nearest neighbors Decision trees Logistic regression Support vector machines

Recall Vector Space Representation Each document is a vector, one component for each term (= word). Normalize to unit length. Terms are axes 10,000+ dimensions, or even 1,000,000+ Docs are vectors in this space

Classification Using Vector Spaces Each training doc a point (vector) labeled by its topic (= class) Hypothesis: docs of the same class form a contiguous region of space Define surfaces to delineate classes in space

Classes in a Vector Space Government Science Arts

Test Document = Government Government Science Arts Similarity hypothesis true in general?

Binary Classification Consider 2 class problems How do we define (and find) the separating surface? How do we test which region a test doc is in?

Separation by Hyperplanes Assume linear separability for now: in 2 dimensions, can separate by a line in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (e.g. perceptron): separator can be expressed as ax + by = c

Linear programming / Perceptron Find a,b,c, such that ax + by  c for red points ax + by  c for green points.

Which Hyperplane? In general, lots of possible solutions for a,b,c.

Which Hyperplane? Lots of possible solutions for a,b,c. Some methods find a separating hyperplane, but not the optimal one (e.g., perceptron) Most methods find an optimal separating hyperplane Which points should influence optimality? All points Linear regression Naïve Bayes Only “difficult points” close to decision boundary Support vector machines Logistic regression (kind of)

Hyperplane: Example Class: “interest” (as in interest rate) Example features of a linear classifier (SVM) w i t i w i t i 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr

Relationship to Naïve Bayes? Find a,b,c, such that ax + by  c for red points ax + by  c for green points.

Linear Classifiers Many common text classifiers are linear classifiers Naïve Bayes Perceptron Rocchio Logistic regression Support vector machines (with linear kernel) Linear regression (Simple) neural networks Despite this similarity, large performance differences For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? What to do for non-separable problems? Different training methods pick different hyperplanes Classifiers more powerful than linear often don’t perform better. Why?

More Than Two Classes Any-of or multiclass classification Classes are independent of each other. A document can belong to 0, 1, or >1 classes. Decompose into n binary problems One-of or multinomial or polytomous classification Classes are mutually exclusive. Each document belongs to exactly one class.

One-Of/Polytomous/Multinomial Classification Digit recognition is polytomous classification Digits are mutually exclusive A given image is to be assigned to exactly one of the 10 classes

Composing Surfaces: Issues ? ? ?

Set of Binary Classifiers: Any of Build a separator between each class and its complementary set (docs from all other classes). Given test doc, evaluate it for membership in each class. Apply decision criterion of classifiers independently Done

Set of Binary Classifiers: One of Build a separator between each class and its complementary set (docs from all other classes). Given test doc, evaluate it for membership in each class. Assign document to class with: maximum score maximum confidence maximum probability Why different from multiclass/any of classification? Naïve Bayes?

Negative Examples Formulate as above, except negative examples for a class are added to its complementary set. Positive examples Negative examples

Centroid Classification Given training docs for a class, compute their centroid Now we have a centroid for each class Given query doc, assign to class whose centroid is nearest. Exercise: Compare to Rocchio

Example Government Science Arts

k Nearest Neighbor Classification To classify document d into class c Define k-neighborhood N as k nearest neighbors of d Count number of documents i in N that belong to c Estimate P(c|d) as i/k

Example: k=6 (6NN) Government Science Arts P(science| )?

kNN Is Close to Optimal Cover and Hart 1967 Asymptotically, the error rate of 1-nearest- neighbor classification is less than twice the Bayes rate. In particular, asymptotic error rate is 0 if Bayes rate is 0. Assume: query point coincides with a training point. Both query point and training point contribute error -> 2 times Bayes rate

kNN: Discussion Classification time linear in training set expensive No feature selection necessary Scales well with large number of classes Don’t need to train n classifiers for n classes Classes can influence each other Small changes to one class can have ripple effect Scores can be hard to convert to probabilities No training necessary Actually: not true. Why?

Number of neighbors

kNN vs. Naïve Bayes Bias/Variance tradeoff Variance ≈ Capacity kNN has high variance and low bias. Infinite memory NB has low variance and high bias. Decision surface has to be linear (hyperplane) Consider: Is an object a tree? (Burges) Too much capacity/variance, low bias Botanist who memorizes Will always say “no” to new object (e.g., #leaves) Not enough capacity/variance, high bias Lazy botanist Says “yes” if the object is green

References Evaluating and Optimizing Autonomous Text Classification Systems (1995) David Lewis. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press. Manning and Schuetze. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer- Verlag, New York.

Decision Tree Example

Decision Trees Tree with internal nodes labeled by terms Branches are labeled by tests on the weight that the term has Leaves are labeled by categories Classifier categorizes document by descending tree following tests to leaf The label of the leaf node is then assigned to the document Most decision trees are binary trees (never disadvantageous; may require extra internal nodes)

Decision trees vs. Hyperplanes Optimal use of a few high-leverage features Hyperplanes integrate many features: 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr

Decision Tree Learning Learn a sequence of tests on features, typically using top-down, greedy search At each stage choose the unused feature with highest Information Gain Binary (yes/no) or continuous decisions f1f1 !f 1 f7f7 !f 7 P(class) =.6 P(class) =.9 P(class) =.2

Decision Tree Learning Fully grown trees tend to have decision rules that are overly specific and are therefore unable to categorize documents well Therefore, pruning or early stopping methods for Decision Trees are normally a standard part of classification packages Use of small number of features is potentially bad in text cat, but in practice decision trees do well for some text classification tasks Decision trees are very easily interpreted by humans – much more easily than probabilistic methods like Naive Bayes Decision Trees are normally regarded as a symbolic machine learning algorithm, though can be used probabilistically

21578 documents 9603 training, 3299 test articles (ModApte split) 118 categories An article can be in more than one category Learn 118 binary category distinctions Example “interest rate” article 2-APR-1987 06:35:19.50 west-germany b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052 FRANKFURT, March 2 The Bundesbank left credit policies unchanged after today's regular meeting of its council, a spokesman said in answer to enquiries. The West German discount rate remains at 3.0 pct, and the Lombard emergency financing rate at 5.0 pct. Common categories (#train, #test) Classic Reuters Data Set Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56)

Classic Reuters Data Set Average document: about 90 types, 200 tokens Average number of classes assigned 1.24 for docs with at least one category Only about 10 out of 118 categories are large Microaveraging measures performance on large categories.

Dumais et al. 1998: Reuters - Accuracy Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category Break Even: (Recall + Precision) / 2

Category: “interest” rate=1 lending=0 prime=0 discount=0 pct=1 year=1year=0 rate.t=1

Reuters ROC - Category Grain Precision Recall LSVM Decision Tree Naïve Bayes Find Similar Recall: % labeled in category among those stories that are really in category Precision: % really in category among those stories labeled in category Two methods with same curve: Same performance?

ROC for Category - Earn LSVM Decision Tree Naïve Bayes Find Similar Precision Recall

ROC for Category - Crude LSVM Decision Tree Naïve Bayes Find Similar Precision Recall

ROC for Category - Ship LSVM Decision Tree Naïve Bayes Find Similar Precision Recall

Take Away Evaluation of text classification kNN Decision trees Hyperplanes Many popular classification methods are linear classifiers Training method is key difference, not the “hypothesis space” of potential classifiers Any-of/multiclass vs One-of/polytomous

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16.

Similar presentations

Presentation on theme: "WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16.

Similar presentations

Presentation on theme: "WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16."— Presentation transcript:

Similar presentations

About project

Feedback