Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processing of large document collections Part 2. Feature selection: IG zInformation gain: measures the number of bits of information obtained for category.

Similar presentations


Presentation on theme: "Processing of large document collections Part 2. Feature selection: IG zInformation gain: measures the number of bits of information obtained for category."— Presentation transcript:

1 Processing of large document collections Part 2

2 Feature selection: IG zInformation gain: measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document

3 Feature selection: estimating the probabilities zLet yterm t occurs in B documents, A of them are in category c ycategory c has D documents, of the whole of N documents in the collection

4 Feature selection: estimating the probabilities zFor instance, yP(t): B/N yP(~t): (N-B)/N yP(c): D/N yP(c|t): A/B yP(c|~t): (D-A)/(N-B)

5 Evaluation of text classifiers zEvaluation of document classifiers is typically conducted experimentally, rather than analytically zreason: in order to evaluate a system analytically, we would need a formal specification of the problem that the system is trying to solve ztext categorization is non-formalisable

6 Evaluation zThe experimental evaluation of a classifier usually measures its effectiveness (rather than its efficiency) yeffectiveness= ability to take the right classification decisions yefficiency= time and space requirements

7 Evaluation zAfter a classifier is constructed using a training set, the effectiveness is evaluated using a test set zthe following counts are computed for each category i: yTP i : true positives yFP i : false positives yTN i : true negatives yFN i : false negatives

8 Evaluation zTP i : true positives w.r.t. category c i ythe set of documents that both the classifier and the previous judgments (as recorded in the test set) classify under c i zFP i : false positives w.r.t. category c i ythe set of documents that the classifier classifies under ci, but the test set indicates that they do not belong to c i

9 Evaluation zTN i : true negatives w.r.t. c i yboth the classifier and the test set agree that the documents in TN i do not belong to c i zFN i : false negatives w.r.t. c i ythe classifier do not classify the documents in FN i under c i, but the test set indicates that they should be classified under c i

10 Evaluation measures zPrecision wrt c i zRecall wrt c i

11 Evaluation measures zFor obtaining estimates for precision and recall in the collection as a whole, two different methods may be adopted: ymicroaveraging xcounts for true positives, false positives and false negatives for all categories are first summed up xprecision and recall are calculated using the global values ymacroaveraging xaverage of precision (recall) for individual categories

12 Evaluation measures zMicroaveraging and macroaveraging may give quite different results, if the different categories have very different generality ze.g. the ability of a classifier to behave well also on categories with low generality (i.e. categories with few positive training instances) will be emphasized by macroaveraging zchoice depends on the application

13 Evaluation measures zAccuracy zis not widely used in TC ylarge value of the denominator makes A insensitive to variations in the number of correct decisions (TP+TN) ytrivial rejector tends to outperform all non- trivial classifiers

14 Evaluation measures zEfficiency yseldom used, although important for applicative purposes ydifficult: environment parameters change ytwo parts xtraining efficiency = average time it takes to build a classifier for a category from a training set xclassification efficiency = average time it takes to classify a new document under a category

15 Combined effectiveness measures zNeither precision nor recall make sense in isolation of each other zthe trivial acceptor (each document is classified under each category) has a recall = 1 yin this case, precision would usually be very low zhigher levels of precision may be obtained at the price of low values of recall

16 Combined effectiveness measures zA classifier should be evaluated by means of a measure which combines recall and precision

17 Reminder: Inductive construction of classifiers zA hard classifier for a category ydefinition of a function that returns true or false, or ydefinition of a function that returns a value between 0 and 1, followed by a definition of a threshold xif the value is higher than the threshold -> true xotherwise -> false

18 Combined effectiveness measures z11-point average precision zthe breakeven point zF1 measure

19 11-point average measure zIn constructing the classifier, the threshold is repeatedly tuned so as to allow recall (for the category) to take up values 0.0, 0.1., …, 0.9, 1.0. zPrecision (for the category) is computed for these 11 different values of precision, and averaged over the 11 resulting values

20 Breakeven point zProcess analoguous to the one used for 11-point average precision is used za plot of precision as a function of recall is computed by repeatedly varying the thresholds zbreakeven is the value where precision equals recall

21 F 1 measure zF 1 measure is defined as: zthe breakeven of a classifier is always less or equal than its F 1 value

22 Effectiveness zOnce an effectiveness measure is chosen, a classifier can be tuned (e.g. thresholds and other parameters can be set) so that the resulting effectiveness is the best achievable by that classifier

23 Conducting experiments zIn general, different sets of experiments may be used for cross-classifier comparison only if the experiments have been performed yon exactly the same collection (same documents and same categories) ywith the same split between training set and test set ywith the same evaluation measure


Download ppt "Processing of large document collections Part 2. Feature selection: IG zInformation gain: measures the number of bits of information obtained for category."

Similar presentations


Ads by Google