The Ups and Downs of Preposition Error Detection in ESL Writing Joel Tetreault[Educational Testing Service] Martin Chodorow[Hunter College of CUNY]

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.

The Ups and Downs of Preposition Error Detection in ESL Writing Joel Tetreault[Educational Testing Service]

Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.

® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task Robert Dale, Ilya Anisimoff and George Narroway Centre for Language Technology.

Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden

1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.

1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.

Rethinking Grammatical Error Detection and Evaluation with the Amazon Mechanical Turk Joel Tetreault[Educational Testing Service] Elena Filatova[Fordham.

Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.

Annie Louis University of Pennsylvania Derrick Higgins Educational Testing Service 1.

The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Page 1 NAACL-HLT BEA Los Angeles, CA Annotating ESL Errors: Challenges and Rewards Alla Rozovskaya and Dan Roth University of Illinois at Urbana-Champaign.

Part of speech (POS) tagging

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

Preposition Usage Errors by English as a Second Language (ESL) learners: “ They ate by* their hands.”  The writer used by instead of with. This work is.

Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS), Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University) May , LREC.

Automated Essay Evaluation Martin Angert Rachel Drossman.

Automatic Grammatical Error Correction for Language Learners Joel Tetreault Johns Hopkins University December 02, 2014.

Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.

Automated Identification of Preposition Errors Joel Tetreault Educational Testing Service ECOLT October 29, 2010.

Preposition Errors in ESL Writings Mohammad Moradi KOWSAR INSTITUTE.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

A Feedback-Augmented Method for Detecting Errors in the Writing of Learners of English Ryo Nagata et al. Hyogo University of Teacher Education ACL 2006.

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

Automatic Grammatical Error Correction for Language Learners Joel Tetreault Claudia Leacock University of Gothenburg September 5, 2014.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Using Contextual Speller Techniques and Language Modeling for ESL Error Correction Michael Gamon, Jianfeng Gao, Chris Brockett, Alexandre Klementiev, William.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

Ling 570 Day 17: Named Entity Recognition Chunking.

Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Error Correction: For Dummies? Ellen Pratt, PhD. UPR Mayaguez.

A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,

Automatic Grammatical Error Correction for Language Learners Joel Tetreault Claudia Leacock Uppsala University September 1, 2014.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow.

Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson.

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.

Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.

Supertagging CMSC Natural Language Processing January 31, 2006.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.

Detecting Missing Hyphens in Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service ACL 2013 Martin Chodorow Hunter College.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artiﬁcial Intelligence Laboratory, MIT, Cambridge ACL 2008.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Language Identification and Part-of-Speech Tagging

Erasmus University Rotterdam

Annotating ESL Errors: Challenges and Rewards

The CoNLL-2014 Shared Task on Grammatical Error Correction

The CoNLL-2014 Shared Task on Grammatical Error Correction

Extracting Why Text Segment from Web Based on Grammar-gram

Presentation transcript:

The Ups and Downs of Preposition Error Detection in ESL Writing Joel Tetreault[Educational Testing Service] Martin Chodorow[Hunter College of CUNY]

Motivation Increasing need for tools for instruction in English as a Second Language (ESL) Preposition usage is one of the most difficult aspects of English for non-native speakers  [Dalgish ’85] – 18% of sentences from ESL essays contain a preposition error  Our data: 8-10% of all prepositions in TOEFL essays are used incorrectly

Why are prepositions hard to master? Prepositions perform so many complex roles  Preposition choice in an adjunct is constrained by its object (“on Friday”, “at noon”)  Prepositions are used to mark the arguments of a predicate (“fond of beer.”)  Phrasal Verbs (“give in to their demands.”) “give in”  “acquiesce, surrender”  Multiple prepositions can appear in the same context “…the force of gravity causes the sap to move _____ the underside of the stem.” [to, onto, toward, on]

Objective Long Term Goal: develop NLP tools to automatically provide feedback to ESL learners on grammatical errors Preposition Error Detection  Selection Error (“They arrived to the town.”)  Extraneous Use (“They came to outside.”)  Omitted (“He is fond this book.”) Coverage: 34 most frequent prepositions

Outline Approach  Obs 1: Classifier Prediction  Obs 2: Training a Model  Obs 3: What features are important? Evaluation on Native Text Evaluation on ESL Text

Observation 1: Classification Problem Cast error detection task as a classification problem Given a model classifier and a context:  System outputs a probability distribution over all prepositions  Compare weight of system’s top preposition with writer’s preposition Error occurs when:  Writer’s preposition ≠ classifier’s prediction  And the difference in probabilities exceeds a threshold

Observation 2: Training a Model Develop a training set of error-annotated ESL essays (millions of examples?):  Too labor intensive to be practical Alternative:  Train on millions of examples of proper usage Determining how “close to correct” writer’s preposition is

Observation 3: Features Prepositions are influenced by:  Words in the local context, and how they interact with each other (lexical)  Syntactic structure of context  Semantic interpretation

Summary 1. Extract lexical and syntactic features from well-formed (native) text 2. Train MaxEnt model on feature set to output a probability distribution over 34 preps 3. Evaluate on error-annotated ESL corpus by: Comparing system’s prep with writer’s prep If unequal, use thresholds to determine “correctness” of writer’s prep

Feature Extraction Corpus Processing:  POS tagged (Maxent tagger [Ratnaparkhi ’98])  Heuristic Chunker  Parse Trees? “In consion, for some reasons, museums, particuraly known travel place, get on many people.” Feature Extraction  Context consists of: +/- two word window Heads of the following NP and preceding VP and NP  25 features consisting of sequences of lemma forms and POS tags

Features FeatureNo. of ValuesDescription PV16,060Prior verb PN23,307Prior noun FH29,815Headword of the following phrase FP57,680Following phrase TGLR69,833Middle trigram (pos + words) TGL83,658Left trigram TGR77,460Right trigram BGL30,103Left bigram He will take our place in the line

Features FeatureNo. of ValuesDescription PV16,060Prior verb PN23,307Prior noun FH29,815Headword of the following phrase FP57,680Following phrase TGLR69,833Middle trigram (pos + words) TGL83,658Left trigram TGR77,460Right trigram BGL30,103Left bigram He will take our place in the line FHPNPV

Features FeatureNo. of ValuesDescription PV16,060Prior verb PN23,307Prior noun FH29,815Headword of the following phrase FP57,680Following phrase TGLR69,833Middle trigram (pos + words) TGL83,658Left trigram TGR77,460Right trigram BGL30,103Left bigram He will take our place in the line. TGLR

Combination Features MaxEnt does not model the interactions between features Build “combination” features of the head nouns and commanding verbs  PV, PN, FH 3 types: word, tag, word+tag  Each type has four possible combinations  Maximum of 12 features

Combination Features ClassComponents+Combo:word p-NFHline N-p-NPN-FHplace-line V-p-NPV-PNtake-line V-N-p-NPV-PN-FHtake-place-line “He will take our place in the line.”

Google-Ngram Features Typical way that non-native speakers check if usage is correct:  “Google” the phrase and alternatives Created a fast-access Oracle database from the POS-tagged Google N-gram corpus Queries provided frequency data for the +Combo features Top three prepositions per query were used as features for ME model  Maximum of 12 Google features

Google Features ClassCombo:wordGoogle Features p-Nline P1= on P2= in P3= of N-p-Nplace-line P1= in P2= on P3= of V-p-Ntake-line P1= on P2= to P3= into V-N-p-Ntake-place-line P1= in P2= on P3= after “He will take our place in the line”

Preposition Selection Evaluation Test models on well-formed native text Metric: accuracy  Compare system’s output to writer’s  Has the potential to underestimate performance by as much as 7% [HJCL ’08] Two Evaluation Corpora: WSJ  test=106k events  train=4.4M NANTC events Encarta-Reuters  test=1.4M events  train=3.2M events  Used in [Gamon+ ’08]

Preposition Selection Evaluation * [Gamon et al., ’08] perform at 64% accuracy on 12 prep’s ModelWSJEnc-Reu* Baseline (of)*26.7%27.2% Lexical70.8%76.5% +Combo71.8%77.4% +Google71.6%76.9% +Both72.4%77.7% +Combo +Extra Data74.1%79.0%

Evaluation on Non-Native Texts Error Annotation  Most previous work used only one rater  Is one rater reliable? [HJCL ’08]  Sampling Approach for efficient annotation Performance Thresholding  How to balance precision and recall?  May not want to optimize a system using F-score ESL Corpora  Factors such as L1 and grade level greatly influence performance  Makes cross-system evaluation difficult

Related Work Most previous work has focused on:  Subset of prepositions  Limited evaluation on a small test corpus

Related Work MethodPerformance [Eeg-Olofsson et al. ’03]Handcrafted rules for Swedish learners 11/40 prepositions correct [Izumi et al. ’03, ’04]ME model to classify 13 error types 25% precision 7% recall [Lee & Seneff ‘06]Stochastic model on restricted domain 80% precision 77% recall [De Felice & Pullman ’08]Maxent model (9 prep’s) ~57% precision ~11% recall [Gamon et al. ’08]LM + decision trees (12 prep’s) 80% precision

Training Corpus for ESL Texts Well-formed text  training only on positive examples 6.8 million training contexts total  3.7 million sentences Two sub-corpora: MetaMetrics Lexile  11 th and 12 th grade texts  1.9M sentences San Jose Mercury News  Newspaper Text  1.8M sentences

ESL Testing Corpus Collection of randomly selected TOEFL essays by native speakers of Chinese, Japanese and Russian 8192 prepositions total (5585 sentences) Error annotation reliability between two human raters:  Agreement =  Kappa = 0.599

Expanded Classifier Pre-Processing Filter Maxent Classifier (uses model from training) Post-Processing Filter Extraneous Use Classifier (PC) Model Maxent Pre Filter DataOutput Post Filter Extran. Use

Pre-Processing Filter Spelling Errors  Blocked classifier from considering preposition contexts with spelling errors in it Punctuation Errors  TOEFL essays have many omitted punctuation marks, which affects feature extraction Tradeoff recall for precision Model Maxent Pre Filter DataOutput Post Filter Extran. Use

Post-Processing Filter Antonyms  Classifier confused prepositions with opposite meanings (with/without, from/to)  Resolution dependent on intention of writer Benefactives  Adjunct vs. argument confusion  Use WordNet to block classifier from marking benefactives as errors Model Maxent Pre Filter DataOutput Post Filter Extran. Use

Prohibited Context Filter Account for 142 of 600 errors in test set Two filters:  Plural Quantifier Constructions (“some of people”)  Repeated Prep’s (“can find friends with with”) Filters cover 25% of 142 errors Model Maxent Pre Filter DataOutput Post Filter Extran. Use

Thresholding Classifier’s Output Thresholds allow the system to skip cases where the top-ranked preposition and what the student wrote differ by less than a pre- specified amount

Thresholds “He is fond with beer” FLAG AS ERROR

Thresholds “My sister usually gets home around 3:00” FLAG AS OK

Results ModelPrecisionRecall Lexical80%12% +Combo:tag82%14% +Combo:tag + Extraneous84%19%

Google Features Adding Google features had minimal impact Using solely Google features (or counts) as a classifier: ~45% accuracy on native text Disclaimer: very naïve implementation

Conclusions Present a combined ML and rule-based approach:  State-of-the-art preposition selection performance: 79%  Accurately detects preposition errors in ESL essays with P=0.84, R=0.19 In instructional applications it is important to minimize false positives  Precision favored over recall This work is included in ETS’s Criterion SM Online Writing Service and E-Rater Also see: “Native Judgments of Non-Native Usage” [HJCL ’08] (tomorrow afternoon)

Common Preposition Confusions Writer’s PrepRater’s PrepFrequency tonull9.5% ofnull7.3% inat7.1% tofor4.6% innull3.2% offor3.1% inon3.1%

Features FeatureNo. of ValuesDescription PV16,060Prior verb PN23,307Prior noun FH29,815Headword of the following phrase FP57,680Following phrase TGLR69,833Middle trigram (pos + words) TGL83,658Left trigram TGR77,460Right trigram BGL30,103Left bigram He will take our place in the line. BGL