Download presentation
Presentation is loading. Please wait.
1
Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology 100 44 Stockholm +46-8-790 66 34 xmartin@nada.kth.se
2
Martin Hassel Why Evaluate? Otherwise you won’t know if what you’re doing is any good! Human languages are very loosely defined This makes it hard to prove that something works (as you do in mathematics or logic)
3
Martin Hassel Aspects of Evaluation General aspects To measure progress Commercial aspects To ensure consumer satisfaction Edge against competitors / PR Scientific aspects Good science
4
Martin Hassel Manual Evaluation Human judges + Semantically based assessment – Subjective – Time consuming – Expensive
5
Martin Hassel Semi-Automatic Evaluation Task based evaluation + Measures the system’s utility – Subjective interpretation of questions and answers
6
Martin Hassel Automatic Evaluation Example from Text Summarization Sentence Recall + Cheap and repeatable – Does not distinguish between different summaries
7
Martin Hassel Why Automatic Evaluation? Manual labor is expensive and takes time It’s practical to be able to evaluate often – does this parameter lead to improvements? It’s tedious to evaluate manually Human factor – People tend to tire and make mistakes
8
Martin Hassel Corpora A body of data considered to represent ”reality” in a balanced way Sampling Raw format vs. annotated data
9
Martin Hassel Corpora can be… a Part-of-Speech tagged data collection Arrangörnn.utr.sin.ind.nom varvb.prt.akt.kop Järfällapm.gen naturföreningnn.utr.sin.ind.nom därha Margaretapm.nom ärvb.prs.akt.kop medlemnn.utr.sin.ind.nom.mad
10
Martin Hassel Corpora can be… a parse tree data collection (S (NP-SBJ (NNP W.R.) (NNP Grace) ) (VP (VBZ holds) (NP (NP (CD three) ) (PP (IN of) (NP (NP (NNP Grace) (NNP Energy) (POS 's) ) (CD seven) (NN board) (NNS seats) ) ) ) ) (..) )
11
Martin Hassel Corpora can be… a collection of sound samples
12
Martin Hassel Widely Accepted Corpora Pros Well-defined origin and context Well-established evaluation schemes Inter-system comparabilitity Cons Optimizing for a specific data set May establish a common “truth”
13
Martin Hassel Gold Standard ”Correct guesses” demand knowing what the result should be This ”optimal” result is often called a gold standard How the gold standard looks and how you count can differ a lot between tasks The basic idea is however the same
14
Martin Hassel Example of a Gold Standard Gold standard for tagging, shallow parsing and clause boundering Han pn.utr.sin.def.sub NPBCLB är vb.prs.akt.kop VCBCLI mest ab.suv ADVPB|APMINBCLI road jj.pos.utr.sin.ind.nom APMINB|APMINICLI av pp PPBCLI äldre jj.kom.utr/neu.sin/plu.ind/def.nom APMINB|NPB|PPICLI sorter nn.utr.plu.ind.nom NPI|PPICLI. Mad 0CLI
15
Martin Hassel Some Common Measures Precision = correct guesses / all guesses Recall = correct guesses / correct answers Precision and recall often are mutually dependant higher recall → lower precision higher precision → lower recall
16
Martin Hassel More Evaluation Terminology True positive – Alarm given at correct point False negative – No alarm when one should be given False positive – Alarm given when none should be given (True negative) – The algorithm is quiet on uninteresting data In e.g. spell checking the above could correspond to detected errors, missed errors, false alarms and correct words without warning.
17
Martin Hassel How Good Is 95%? It depends on what problem you are solving! Try to determine expected upper and lower bounds for performance (of a specific task) A baseline tells you the performance of a naïve approach (lower bound)
18
Martin Hassel Lower Bound Baselines Serve as lower limit of acceptability Common to have several baselines Common baselines Random Most common choice/answer (e.g. in tagging) Linear selection (e.g. in summarization)
19
Martin Hassel Upper Bound Sometimes there is an upper bound lower than 100% Example: In 10% of all cases experts disagree on the correct answer Human ceiling (inter-assessor agreement) Low inter-assessor agreement can sometimes be countered with comparison against several ”sources”
20
Martin Hassel Limited Data Limited data is often a problem, especially in machine learning We want lots of data for training Better results We want lots of data for evaluation More reliable numbers If possible, create your own data! Missplel
21
Martin Hassel Limited Data N-fold Cross Validation Idea: 1 Set 5% of the data aside for evaluation and train on 95% 2 Set another 5% aside for evaluation and repeat training on 95% 3 … and again (repeat in total 20 times) Take the mean of the evaluation results to be the final result
22
Martin Hassel Concrete Examples Taggning Force the tagger to assign exactly one tag to each token – precision? Parsing What happens when almost correct? Partial trees, how many sentences got full trees? Spell checking Recall / precision for alarms How far down in the suggestion list is the correct suggestion?
23
Martin Hassel Concrete Examples Grammar checking How many are false alarms (precision)? How many errors are detected (recall)? How many of these have the correct diagnosis? Machine translation & Text Summarization How many n-grams overlap with gold standard(s)? BLEU scores & ROUGE scores
24
Martin Hassel Concrete Examples Information retrieval What is the precision of the first X hits? At Y% recall? Mean precision. Text categorizing How many documents were correctly classified?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.