Detecting collocation errors in English Language Learners’ writing Yoko Futagi Educational Testing Service ECOLT October 29, 2010.

Slides:



Advertisements
Similar presentations
Keyboarding Objective Apply language skills in keyed documents
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
 Before you submit your paper, check these things.
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
Working with ESL Students Issues and Solutions. Common Characteristics of an ESL Session Research shows tutoring sessions with ESL tend to: ◦ Be more.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
Bilingual Dictionaries
D ETERMINING THE S ENTIMENT OF O PINIONS Presentation by Md Mustafizur Rahman (mr4xb) 1.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Linda Borger Department of Education, University of Gothenburg.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
Stemming, tagging and chunking Text analysis short of parsing.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Page 1 NAACL-HLT BEA Los Angeles, CA Annotating ESL Errors: Challenges and Rewards Alla Rozovskaya and Dan Roth University of Illinois at Urbana-Champaign.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Keyboarding Objective 3.01 Interpret Proofreader Marks
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Research methods in corpus linguistics Xiaofei Lu.
Introduction.  Classification based on function role in classroom instruction  Placement assessment: administered at the beginning of instruction 
Lesson 9: Peer Review Topics Role of the Peer Reviewer
Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS), Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University) May , LREC.
Mining and Summarizing Customer Reviews
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Assisting cloze test making with a web application Ayako Hoshino ( 星野綾子 ) Hiroshi Nakagawa ( 中川裕志 ) University of Tokyo ( 東京大学 ) Society for Information.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Error Correction: For Dummies? Ellen Pratt, PhD. UPR Mayaguez.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow.
Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
CAREERS STUDY SKILLS AND HABITS. STUDY HABITS Before you can improve your study habits, you have to develop “a plan;” This is based on your previous habits,
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
On using context for automatic correction of non-word misspellings in student essays Michael Flor Yoko Futagi Educational Testing Service 2012 ACL.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
1 Vocabulary acquisition from extensive reading: A case study Maria Pigada and Norbert Schmitt ( 2006)
Take notes! I don’t want to see any of these errors in future writing assignments.
Advantages of CALL. Some Advantages of CALL 1)CALL can adapt to the learners' abilities and preferences. CALL can adapt to the learner’s self- paced learning.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Natural Language Processing Vasile Rus
Extracting Semantic Concept Relations
The CoNLL-2014 Shared Task on Grammatical Error Correction
Hong Kong English in Students’ Writing
Statistical n-gram David ling.
Presentation transcript:

Detecting collocation errors in English Language Learners’ writing Yoko Futagi Educational Testing Service ECOLT October 29, 2010

2 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 2 Outline Motivation and goal System description Summary

3 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 3 What are collocations? “A sequence of words or terms which co-occur more often than would be expected by chance” (Wikipedia) Examples: –hold a meeting, not clench a meeting (note: clench teeth) –powerful computer rather than strong computer (note: strong tea rather than powerful tea) Our working definition: –Combinations of words that are frequent “enough”

4 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 4 Collocation affects fluency; poor use of collocation can disrupt communication. (Howarth 1998, Martynska 2004, Wray and Perkins 2000, etc.). Goal: Development of a tool which detects collocation errors and suggests alternatives (“collocation tool”) Why are collocations important?

5 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 5 Approaches (1) Manual construction - dictionaries Corpora-based –Manual tagging of errors –Aligned bilingual corpora (machine learning) –Large-scale corpora

6 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 6 Approaches (2) Wible et al. (2002) –Constructed a learner corpora that was partially manually annotated Chan & Liou (2005) –System to teach English collocation using aligned bilingual corpora (TotalRecallTM) Microsoft U.S. patent –Uses aligned corpora to construct miscollocation database Seretan et al. (2005) –Uses the web to look up collocates of a given word Lin (1998) –Uses a broad coverage parser and mutual information statistic; online collocation checker Shei & Pain (2000) –Focuses on V-N pattern –Similar strategy to our system, but involves manual checking of errors.

7 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes To compensate for common learner errors

8 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 8 Common learner errors: Misspellings and inflection/article errors They can mess up the performance of Parts-of-Speech (“POS”) tagger (a software which goes through a text and tags each word with an appropriate part-of-speech such as “singular noun” or “past-tense verb”) N-grams (strings of words) containing them, especially misspellings, can’t be found in a database built from well-formed texts (this can cause a false error-flagging)  Misspelled strings are currently omitted.

9 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes

10 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 10 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: Target patterns Adjective-Noun strong tea, powerful computer Noun-Noun bee hive, house arrest Noun-of-Noun a swarm of bees Verb-Object Noun throw a party, reject an appeal Verb-Adverb/Adverb-Verb argue strenuously Phrasal verb turn off the light, grow up

11 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 11 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process Parts-of-Speech (“POS”) tagger (a software which goes through a text and tags each word with an appropriate part-of- speech such as “singular noun” or “past-tense verb”) Example: I used to have a very strict sckedule. I used to separate the leiser time and work time. I would like to live with a person who also likes the same things. After POS tagging: I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ sckedule_NN._.I_PRP used_VBD to_TO separate_VB the_DT leiser_NN time_NN and_CC work_NN time_NN._. I_PRP would_MD like_VB to_TO live_VB with_IN a_DT person_NN who_WP also_RB likes_VBZ the_DT same_JJ things_NNS._.

12 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 12 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process, cont’d The tool then scans the text and looks for patterns of parts-of-speech that match the target syntactic patterns I_PRP used_VBD to_TO have_VB a_DT very_RB strict_JJ sckedule_NN I_PRP used_VBD to_TO separate_VB the_DT leiser_NN time_NN and_CC work_NN time_NN._. I_PRP would_MD like_VB to_TO live_VB with_IN a_DT person_NN who_WP also_RB likes_VBZ the_DT same_JJ things_NNS._.

13 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 13 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find collocation candidates: The process, cont’d While scanning the text, words which typically do not participate in collocations, such as numbers, pronouns, thing, also, always, etc., are ignored to prevent picking up false candidates have_VB a_DT very_RB strict_JJ sckedule_NN strict_JJ sckedule_NN separate_VB the_DT leiser_NN time_NN leiser_NN time_NN work_NN time_NN likes_VBZ the_DT same_JJ things_NNS also_RB likes_VBZ

14 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes

15 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 15 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Spell Checking Misspelled candidate strings are thrown out, since they would not be found in the database have_VB a_DT very_RB strict_JJ sckedule_NN strict_JJ sckedule_NN separate_VB the_DT leiser_NN time_NN leiser_NN time_NN work_NN time_NN

16 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes

17 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 17 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Generate morphology and article variations If there is an article or inflection error, the string would not be found – especially problematic for non-native speakers’ writing. Solution: create variants of the original string by: –varying the article (a/an/the/Ø for singular, the/Ø for plural: original: leaving cityleaving a city leaving the city –varying verb or noun inflection: original: pick applespicked apples picks applespicking apples etc…

18 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes

19 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 19 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Find synonyms Powerful and strong are synonyms. and... Strong tea occurs much more frequently than powerful tea ↓ We can say with confidence that strong tea is a collocation.

20 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Tool Design Student essay Candidate string extraction (POS- tagger, pattern match) Spell- checking Correctly spelled? OK/ERROR Inflection and article variation Find synonyms Reference database lookup Decision- making algorithm Omit string No Omit string Yes

21 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 21 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Look up in the reference database Larger the corpora the database is created from, the better the coverage. Generally speaking, raw counts are not used to measure collocational “strength” of a string. → This is because some words are simply more common than others. Instead, word-association statistics are used.

22 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 22 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Evaluation Procedure –2000+ candidate strings extracted from 300 randomly selected TOEFL essays, with OK or ERROR decision by the collocation tool –2 native speakers (“annotators”) were asked to mark each string with one of the following judgments: 3 = “This sounds natural” 2 = “Not so good, but not impossible” 1 = “This is really unnatural; I would never say it like this”. –Annotators agreed on 1020 strings.

23 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 23 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Agreement between raters and collocation tool PrecisionRecall ERROR judgment OK judgment Precision and recall explanation? Precision = | OK recognized by the annotator  OK detected by the tool | | OK detected by the tool | Recall = | OK recognized by the annotator  OK detected by the tool | | OK recognized by the annotator |

24 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 24 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Sources of tool errors Tool errors: –POS tagger errors –Pattern-matching errors –Reference database coverage Writer errors: –Misspellings resulting in real words (e.g. by spelled as buy) –Grammar errors –Incomprehensible “sentences”

25 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 25 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Future plans Decrease tool errors –Try a more robust POS-tagger –Try using a syntactic parser –Increase database size Improve tool speed – Eliminate the need for generating inflection and article variations → build a special database Improve correction suggestions

26 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 26 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Acknowledgments Martin Chodorow Paul Deane Sarah Ohls Joel Tetreault Cathy Trapani Vincent Weng Waverely vanWinkle

27 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. 27 Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Thank you.