Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.

Similar presentations


Presentation on theme: "Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff."— Presentation transcript:

1 Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff The CoNLL-2013 Shared Task on Grammatical Error Correction, Ng el al. Better Evaluation for Grammatical Error Correction, Dahlmeier and Ng

2 Annotator tagSystem output Annotator tag System output Learner sentence Standard NLP evaluation Error detection evaluation Resource: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al.

3  Comma restoration task  Commas are removed from well edited text (gold standard)  System tries to restore commas by predicting their locations  Comparison: ▪ Binary distinction (presence or absence of comma)

4

5  Comma error detection task  System seeks to find and correct errors in the write’s usage of commas  Intricacies: ▪ Positive class: Error of the writer that involves comma (not presence of comma)  Mismatch between writer’s sentence and the annotator’s judgement ▪ Negative class: writer and annotator agree ▪ System’s judgement has not been considered yet ▪ Writer-Annotator-System (WAS)

6 Contingency scheme for WAS Considering System prediction and Writer’s form together

7 Contingency scheme for WAS Considering System prediction and Gold standard together

8

9 Simplified contingency scheme

10

11

12

13 Expected proportion of TP match

14 System1: Predications are correct at chance level

15 System2: Prevalence and bias remain the same

16 System3: Increase bias and prevalence + Predications are correct at chance level

17

18  Dealing with sensitivity to bias  Vary threshold and generate precision-recall curve

19  Dealing with sensitivity to bias  Area under Receiver Operating Characteristic (AUC) curve curve for random prediction Effect of random prediction is not nullified Area under random prediction

20 Class skewedness is already taken care of False Positive Rate True Positive Rate

21  Positive class consists of an error in writer’s text  No 1:1:1 correspondence between writer’s sentence, annotator’s correction and type of error Book of my class inpired me A Book in my class inspired me Books for my class inspired me The books of my class were inspiring to me Article error Number error Article+Number error

22  Assuming no ambiguity in error type  What would be the size of unit over which error is defined? The book in my class inspire me a) The book in my class inspires me b) The books in my class inspire me Unit size: Morpheme level? Word level? Phrase level? String level? Token-based approach vs String-based approach

23 EDM can handle both

24  EDMs are good for comparison not for providing feedback to the writer  If book and inspire are not linked feedback like violation in subject-verb agreement cannot be provided

25

26

27 Accuracy: 0.54, Kappa = 0.00 Accuracy: 0.77, Kappa = 0.21 Inject 100 more TNs

28

29

30  Extraction of system edit from writer’s text (source) and system output (hypothesis)  done with GNU wdiff utility Source: Our baseline system feeds word into PB-SMT pipeline Hypothesis: Our baseline system feeds a word into PB-SMT pipeline Hypothesis matches with first gold standard edit but flagged as invalid

31  Key idea  There may be multiple ways to arrive at the same correction  Extraction of the set of edits that matches the gold standard maximally

32

33  Notations  An edit is a tripple ▪ Start and end token offsets a and b with respect to a source sentence. ▪ A correction C. ▪ For gold standard edit C is set of corrections ▪ For system edit C is a single correction

34

35  Edit metric: Levenshtein distance  Minimum number of insertions, deletions and substitutions needed to transform one string to another  How to compute levenshtein distance? ▪ Use a 2-D matrix (Levenstein matrix) to store edit costs of substrings of string pairs ▪ Compute individual cell entries (edit costs) with dynamic programming ▪ Rightmost corner cell stores optimal edit cost

36  Slides from Jurafsky course page

37  Spell correction  The user typed “graffe” Which is closest? ▪ graf ▪ graft ▪ grail ▪ giraffe Computational Biology Align two sequences of nucleotides Resulting alignment: Also for Machine Translation, Information Extraction, Speech Recognition AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC - AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

38  The minimum edit distance between two strings  Is the minimum number of editing operations  Insertion  Deletion  Substitution  Needed to transform one into the other

39  Two strings and their alignment:

40  If each operation has cost of 1  Distance between these is 5  If substitutions cost 2 (Levenshtein)  Distance between them is 8

41  Searching for a path (sequence of edits) from the start string to the final string:  Initial state: the word we’re transforming  Operators: insert, delete, substitute  Goal state: the word we’re trying to get to  Path cost: what we want to minimize: the number of edits

42  But the space of all edit sequences is huge!  We can’t afford to navigate naïvely  Lots of distinct paths wind up at the same state. ▪ We don’t have to keep track of all of them ▪ Just the shortest path to each of those revisited states.

43  For two strings  X of length n  Y of length m  We define D(i,j)  the edit distance between X[1..i] and Y[1..j] ▪ i.e., the first i characters of X and the first j characters of Y  The edit distance between X and Y is thus D(n,m)

44  Dynamic programming:  Solving problems by combining solutions to subproblems.  A tabular computation of D(n,m)  Bottom-up  We compute D(i,j) for small i,j  And compute larger D(i,j) based on previously computed smaller values  i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

45  Initialization D(i,0) = i D(0,j) = j  Recurrence Relation For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j)  Termination D(N,M) is distance insertion deletion substitution

46 N9 O8 I7 T6 N5 E4 T3 N2 I1 #0123456789 #EXECUTION

47 N9 O8 I7 T6 N5 E4 T3 N2 I12 #0123456789 #EXECUTION The Edit Distance Table D(0,1) + 1 D(1,1)= min D(1,0) + 1 D(0,0) + 2; if X(1) ≠ Y(1) 0; if X(1) = Y(1)

48 N989101112111098 O8789 1110989 I76789 989 T65678989 11 N5456789101110 E43456789 9 T3456787898 N2345678787 I1234567678 #0123456789 #EXECUTION The Edit Distance Table

49  Edit distance isn’t sufficient  We often need to align each character of the two strings to each other  We do this by keeping a “backtrace”  Every time we enter a cell, remember where we came from  When we reach the end,  Trace back the path from the upper right corner to read off the alignment

50 N989101112111098 O8789 1110989 I76789 989 T65678989 11 N5456789101110 E43456789 9 T3456787898 N2345678787 I1234567678 #0123456789 #EXECUTION The Edit Distance Table

51  Base conditions: Termination: D(i,0) = i D(0,j) = j D(N,M) is distance  Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 D(i,j)= min D(i,j-1) + 1 D(i-1,j-1) + 2; if X(i) ≠ Y(j) 0; if X(i) = Y(j) LEFT ptr(i,j)= DOWN DIAG insertion deletion substitution insertion deletion substitution

52

53

54 Source: Our baseline system feeds word into PB-SMT pipeline Hypothesis: Our baseline system feeds a word into PB-SMT pipeline

55 012345678910 #ourbaselinesystemfeedsawordintoPB-SMTpipeline. 0# 1Our 2baseline 3system 4feeds 5word 6into 7PB-SMT 8pipeline 9.

56 012345678910 #ourbaselinesystemfeedsawordintoPB-SMTpipeline. 0#012345678910 1Our10123456789 2baseline21012345678 3system32101234567 4feeds43210123456 5word54321112345 6into65432221234 7PB-SMT76543332123 8pipeline87654443212 9.98765554321

57 012345678910 #ourbaselinesystemfeedsawordintoPB-SMTpipeline. 0#012345678910 1Our10123456789 2baseline21012345678 3system32101234567 4feeds43210123456 5word54321112345 6into65432221234 7PB-SMT76543332123 8pipeline87654443212 9.98765554321

58

59 0,01,1 Our(1) 2,2 baseline(1) 3,3 system(1) 4,4 feeds(1) 4,5 5,6 word(1) 6,7 into(1) 7,8 PB-SMT(1) 8,9 pipeline(1) 9,10.(1)

60

61

62 0,01,1 Our(1) 2,2 baseline(1) 3,3 system(1) 4,4 feeds(1) 4,5 5,6 word(1) 6,7 into(1) 7,8 PB-SMT(1) 8,9 pipeline(1) 9,10.(1) feeds/feeds a(2) word/a word(2) system feeds/system feeds a(3) word into/a word into (3) feeds word/feeds a word(3)

63 0,01,1 Our(1) 2,2 baseline(1) 3,3 system(1) 4,4 feeds(1) 4,5 5,6 word(1) 6,7 into(1) 7,8 PB-SMT(1) 8,9 pipeline(1) 9,10.(1) feeds/feeds a(2) word/a word(2) system feeds/system feeds a(3) word into/a word into (3) feeds word/feeds a word(3) word/a word(-45)

64  Perform a single-source shortest path with negative weights from start to end vertex  Bellman-Ford algorithm

65  Theorem  The set of edits corresponding to the shortest path has the maximum overlap with the gold standard annotation.

66


Download ppt "Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff."

Similar presentations


Ads by Google