Download presentation
Presentation is loading. Please wait.
1
1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006
2
2 Evaluation Measures
3
3 Precision: Proportion of those you labeled X that the gold standard thinks really is X #correctly labeled by alg/ all labels assigned by alg #True Positive / (#True Positive + #False Positive) Recall: Proportion of those items that are labeled X in the gold standard that you actually label X #correctly labeled by alg / all possible correct labels #True Positive / (#True Positive + # False Negative)
4
4 F-measure Can “cheat” with precision scores by labeling (almost) nothing with X. Can “cheat” on recall by labeling everything with X. The better you do on precision, the worse on recall, and vice versa The F-measure is a balance between the two. 2*precision*recall / (recall+precision)
5
5 Evaluation Measures Accuracy: Proportion that you got right (#True Positive + #True Negative) / N N = TP + TN + FP + FN Error: (#False Positive + #False Negative)/N
6
6 Prec/Recall vs. Accuracy/Error When to use Precision/Recall? Useful when there are only a few positives and many many negatives Also good for ranked ordering –Search results ranking When to use Accuracy/Error When every item has to be judged, and it’s important that every item be correct. Error is better when the differences between algorithms are very small; let’s you focus on small improvements. –Speech recognition
7
7 Evaluating Partial Parsing How do we evaluate it?
8
8 Evaluating Partial Parsing
9
9 Testing our Simple Fule Let’s see where we missed:
10
10 Update rules; Evaluate Again
11
11 Evaluate on More Examples
12
12 Incorrect vs. Missed Add code to print out which were incorrect
13
13 Missed vs. Incorrect
14
14 What is a good Chunking Baseline?
15
15
16
16 The Tree Data Structure
17
17 Baseline Code (continued)
18
18 Evaluating the Baseline
19
19 Cascaded Chunking
20
20
21
21 Next Time Summarization
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.