Download presentation
Presentation is loading. Please wait.
1
September 2003 1 EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003
2
September 2003 2 The rise of empiricism CL up until the 1980s primarily a theoretical discipline The experimental methodology is now paid much more attention to
3
September 2003 3 Empirical methodology & evaluation Starting with the big US ASR competitions of the 1980s, evaluation has progressively become a central component of work in NL – DARPA Speech initiative – MUC – TREC GOOD: – Much easier for community (& researchers themselves) to understand which proposals are really improvements BAD: – too much focus on small improvements – cannot afford to try entirely new technique (may not lead to improvements for a couple of years!)
4
September 2003 4 Typical developmental methodology in CL
5
September 2003 5 Training set and test set Models estimated / systems developed using a TRAINING SET The training set should be - representative of the task - as large as possible - well-known and understood
6
September 2003 6 The test set Estimated models evaluated using a TEST SET The test set should be - disjoint from the training set - large enough for results to be reliable - unseen
7
September 2003 7 Possible problems with the training set Too small performance drops OVERFITTING can be reduced using - cross-validation (large variance may mean training set too small) - large priors
8
September 2003 8 Possible problems with the test set Are results using the test set believable? - results might be distorted if too easy / hard - training set and test set may be too different (language non stationary)
9
September 2003 9 Evaluation Two types: - BLACK BOX (system as a whole) - WHITE BOX (components independently) Typically QUANTITATIVE (but need QUALITATIVE as well)
10
September 2003 10 Simplest quantitative evaluation metrics ACCURACY: percentage correct (against some gold standard) - e.g., tagger gets 96.7% of tags correct when evaluated using the Penn Treebank ERROR: percentage wrong - ERROR REDUCTION most typical metric in ASR
11
September 2003 11 A more general form of evaluation: precision & recall sfvfvfnv lnv sjjsjnskf fsavdkf d lsjnvjf fvjnf dfj djf v lafnlanflj aff rvjfkjfkbv KFKRQVFsjfanvnf CDKBCWDK
12
September 2003 12 Positives and negatives TRUE NEGATIVES FALSE NEGATIVES TPFP
13
September 2003 13 Precision and recall PRECISION: proportion correct AMONG SELECTED ITEMS RECALL: proportion of correct items selected
14
September 2003 14 The tradeoff between precision and recall Easy to get high precision: never classify anything Easy to get high recall: return everything Really need to report BOTH, or F-measure
15
September 2003 15 Simple vs. multiple runs Single run may be lucky: - Do multiple runs - Report averaged results - Report degree of variation - Do SIGNIFICANCE TESTING (cfr. t-test, etc.) A lot of people are lazy and just report single runs.
16
September 2003 16 Interpreting results A 97% accuracy may look impressive … but not so much if 98% of items have same tag: need BASELINE An F measure of.7 may not look very high unless told that humans only achieve.71 at this task: need UPPER BOUND
17
September 2003 17 Confusion matrices Once you’ve evaluated your model, you may want to try to do some ERROR ANALYSIS. This usually done with a CONFUSION MATRIX..JJ..NN..VB.. JJ25.. NN37.. VB154
18
September 2003 18 Readings Manning and Schütze, chapter 8.1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.