September EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003
September The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to
September Empirical methodology & evaluation Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL – DARPA Speech initiative – MUC – TREC GOOD: – Much easier for community (& researchers themselves) to understand which proposals are really improvements BAD: – too much focus on small improvements – cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)
September Typical developmental methodology in CL
September Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be - representative of the task - as large as possible - well-known and understood
September The test set Estimated models evaluated using a TEST SET The test set should be - disjoint from the training set - large enough for results to be reliable - unseen
September Possible problems with the training set Too small performance drops OVERFITTING can be reduced using - cross-validation (large variance may mean training set too small) - large priors
September Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard - training set and test set may be too different (language non stationary)
September Evaluation Two types: - BLACK BOX (system as a whole) - WHITE BOX (components independently) Typically QUANTITATIVE (but need QUALITATIVE as well)
September Simplest quantitative evaluation metrics ACCURACY: percentage correct (against some gold standard) - e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong - ERROR REDUCTION most typical metric in ASR
September A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskf fsavdkf d lsjnvjf fvjnf dfj djf v lafnlanflj aff rvjfkjfkbv KFKRQVFsjfanvnf CDKBCWDK
September Positives and negatives TRUE NEGATIVES FALSE NEGATIVES TPFP
September Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected
September The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure
September Simple vs. multiple runs Single run may be lucky: - Do multiple runs - Report averaged results - Report degree of variation - Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.
September Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of.7 may not look very high unless told that humans only achieve.71 at this task: need UPPER BOUND
September Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX..JJ..NN..VB.. JJ25.. NN37.. VB154
September Readings Manning and Schütze, chapter 8.1