Detecting collocation errors in English Language Learners’ writing Yoko Futagi Educational Testing Service ECOLT October 29, 2010
2 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 2 Outline Motivation and goal System description Summary
3 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 3 What are collocations? “A sequence of words or terms which co-occur more often than would be expected by chance” (Wikipedia) Examples: –hold a meeting, not clench a meeting (note: clench teeth) –powerful computer rather than strong computer (note: strong tea rather than powerful tea) Our working definition: –Combinations of words that are frequent “enough”
4 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 4 Collocation affects fluency; poor use of collocation can disrupt communication. (Howarth 1998, Martynska 2004, Wray and Perkins 2000, etc.). Goal: Development of a tool which detects collocation errors and suggests alternatives (“collocation tool”) Why are collocations important?
5 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 5 Approaches (1) Manual construction - dictionaries Corpora-based –Manual tagging of errors –Aligned bilingual corpora (machine learning) –Large-scale corpora
6 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 6 Approaches (2) Wible et al. (2002) –Constructed a learner corpora that was partially manually annotated Chan & Liou (2005) –System to teach English collocation using aligned bilingual corpora (TotalRecallTM) Microsoft U.S. patent –Uses aligned corpora to construct miscollocation database Seretan et al. (2005) –Uses the web to look up collocates of a given word Lin (1998) –Uses a broad coverage parser and mutual information statistic; online collocation checker Shei & Pain (2000) –Focuses on V-N pattern –Similar strategy to our system, but involves manual checking of errors.
7 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes To compensate for common learner errors
8 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 8 Common learner errors: Misspellings and inflection/article errors They can mess up the performance of Parts-of-Speech (“POS”) tagger (a software which goes through a text and tags each word with an appropriate part-of-speech such as “singular noun” or “past-tense verb”) N-grams (strings of words) containing them, especially misspellings, can’t be found in a database built from well-formed texts (this can cause a false error-flagging) Misspelled strings are currently omitted.
9 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes
10 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 10 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: Target patterns Adjective-Noun strong tea, powerful computer Noun-Noun bee hive, house arrest Noun-of-Noun a swarm of bees Verb-Object Noun throw a party, reject an appeal Verb-Adverb/Adverb-Verb argue strenuously Phrasal verb turn off the light, grow up
11 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 11 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process Parts-of-Speech (“POS”) tagger (a software which goes through a text and tags each word with an appropriate part-of- speech such as “singular noun” or “past-tense verb”) Example: I used to have a very strict sckedule. I used to separate the leiser time and work time. I would like to live with a person who also likes the same things. After POS tagging: I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ sckedule_NN._.I_PRP used_VBD to_TO separate_VB the_DT leiser_NN time_NN and_CC work_NN time_NN._. I_PRP would_MD like_VB to_TO live_VB with_IN a_DT person_NN who_WP also_RB likes_VBZ the_DT same_JJ things_NNS._.
12 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 12 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process, cont’d The tool then scans the text and looks for patterns of parts-of-speech that match the target syntactic patterns I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ sckedule_NN I_PRP used_VBD to_TO separate_VB the_DT leiser_NN time_NN and_CC work_NN time_NN._. I_PRP would_MD like_VB to_TO live_VB with_IN a_DT person_NN who_WP also_RB likes_VBZ the_DT same_JJ things_NNS._.
13 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 13 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process, cont’d While scanning the text, words which typically do not participate in collocations, such as numbers, pronouns, thing, also, always, etc., are ignored to prevent picking up false candidates have_VB a_DT very_RB strict_JJ sckedule_NN strict_JJ sckedule_NN separate_VB the_DT leiser_NN time_NN leiser_NN time_NN work_NN time_NN likes_VBZ the_DT same_JJ things_NNS also_RB likes_VBZ
14 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes
15 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 15 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Spell Checking Misspelled candidate strings are thrown out, since they would not be found in the database have_VB a_DT very_RB strict_JJ sckedule_NN strict_JJ sckedule_NN separate_VB the_DT leiser_NN time_NN leiser_NN time_NN work_NN time_NN
16 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes
17 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 17 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Generate morphology and article variations If there is an article or inflection error, the string would not be found – especially problematic for non-native speakers’ writing. Solution: create variants of the original string by: –varying the article (a/an/the/Ø for singular, the/Ø for plural: original: leaving cityleaving a city leaving the city –varying verb or noun inflection: original: pick applespicked apples picks applespicking apples etc…
18 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes
19 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 19 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find synonyms Powerful and strong are synonyms. and... Strong tea occurs much more frequently than powerful tea ↓ We can say with confidence that strong tea is a collocation.
20 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes
21 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 21 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Look up in the reference database Larger the corpora the database is created from, the better the coverage. Generally speaking, raw counts are not used to measure collocational “strength” of a string. → This is because some words are simply more common than others. Instead, word-association statistics are used.
22 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 22 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Evaluation Procedure –2000+ candidate strings extracted from 300 randomly selected TOEFL essays, with OK or ERROR decision by the collocation tool –2 native speakers (“annotators”) were asked to mark each string with one of the following judgments: 3 = “This sounds natural” 2 = “Not so good, but not impossible” 1 = “This is really unnatural; I would never say it like this”. –Annotators agreed on 1020 strings.
23 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 23 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Agreement between raters and collocation tool PrecisionRecall ERROR judgment OK judgment Precision and recall explanation? Precision = | OK recognized by the annotator OK detected by the tool | | OK detected by the tool | Recall = | OK recognized by the annotator OK detected by the tool | | OK recognized by the annotator |
24 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 24 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Sources of tool errors Tool errors: –POS tagger errors –Pattern-matching errors –Reference database coverage Writer errors: –Misspellings resulting in real words (e.g. by spelled as buy) –Grammar errors –Incomprehensible “sentences”
25 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 25 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Future plans Decrease tool errors –Try a more robust POS-tagger –Try using a syntactic parser –Increase database size Improve tool speed – Eliminate the need for generating inflection and article variations → build a special database Improve correction suggestions
26 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 26 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Acknowledgments Martin Chodorow Paul Deane Sarah Ohls Joel Tetreault Cathy Trapani Vincent Weng Waverely vanWinkle
27 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 27 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Thank you.