Download presentation
Presentation is loading. Please wait.
Published byHollie Collins Modified over 9 years ago
1
Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson Todd
2
Self-correction of writing n Language learning or language use n Resources for writers: –Dictionaries (e.g. COBUILD) Common syntactic patterns but needs awareness –Lists of common errors Limited number of errors covered –Grammar checkers Potentially useful if designed for non-native writers (c)2009 Richard Watson Todd
3
Principles of grammar checker design n Pattern matching –e.g. common phrases –limited (like lists of common errors) n Parsing and rule-based –e.g. subject-verb agreement –useful for syntax but limited application n Corpus-based probabilistic analysis –lexically-based on co-occurrence of words –very local errors only (c)2009 Richard Watson Todd
4
Conducting a corpus-based probabilistic analysis n Construct a large corpus (100 million words) n For most common 6,700 words, identify all possible bigrams (44 million) n Calculate z-scores of bigrams to identify errors n 40 million bigram errors (c)2009 Richard Watson Todd
5
The problem n Identifying errors is relatively easy n Providing good suggestions for correcting errors is more difficult n Is it possible to provide correct suggestions for word-word co-occurrence errors through analysis of a large corpus? (c)2009 Richard Watson Todd
6
The approach n Collect 200 sentences from student writing containing word-word errors n Generate multiple methods of correcting the errors n Evaluate the methods n Produce algorithms based on common patterns (c)2009 Richard Watson Todd
7
An example n He drives a red colour car. –A. Delete “red”? –B. Delete “colour”? –C. Switch “red” and “colour”? –D. Replace “red” with another word? –E. Replace “colour” with another word? –F. Insert a word between “red” and “colour”? (c)2009 Richard Watson Todd
8
Checking deleting and switching n He drives a red colour car. –A. Delete “red” –Result: He drives a colour car. –Check z-score of co-occurrence of a + colour + car –If z-score is high, possible method –Do the same for: B. Delete “colour” C. Switch “red” and “colour” (c)2009 Richard Watson Todd
9
Finding words to replace or insert n He drives a red colour car. –D. Replace “red” with another word –He drives a red colour car. –Search for trigram: a X colour –Identify trigram with highest z-score for: a + X + colour –Do the same for: E. Replace “colour” with another word [red + X + car] F. Insert a word between “red” and “colour” [red + X + colour] (c)2009 Richard Watson Todd
10
Evaluating methods and producing algorithms n For each error, up to 6 methods of generating suggestions possible n Evaluations based on judgments of appropriacy of suggestion by a native speaker n Patterns identified for parts of speech (there are 12,000 POS-POS-POS trigrams but 300 billion word-word-word trigrams) n 8 algorithms produced n Sample algorithm: –Replace first word (i.e. method D) when the second word is (noun OR verb OR preposition) and first word is adjective preceded by adverb (c)2009 Richard Watson Todd
11
Validation of algorithms n Procedures applied to further sentences from student writing n Applying algorithms provides correct suggestions for 45% of errors identified –Pattern matching and rule-based algorithms provide correct suggestions for 90% of errors –Corpus-based sections cover a greater number of less predictable errors (c)2009 Richard Watson Todd
12
Implications for lexicography n Growth in use of electronic dictionaries n Growth in number of aspects covered by dictionaries –originally only spelling and meaning –now examples of use, syntactic patterns, register, variants, synonyms etc. –in the future suggestions for correcting errors? n In 20 years’ time, integration of dictionaries and grammar checkers? (c)2009 Richard Watson Todd
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.