1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.

Slides:

Advertisements

Similar presentations

Caroline Rougier, Jean Meunier, Alain St-Arnaud, and Jacqueline Rousseau IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5,

Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Learning Algorithm Evaluation

A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.

Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden

1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.

Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Model Evaluation Metrics for Performance Evaluation

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.

Midterm Review CS4705 Natural Language Processing.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology Stockholm

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.

1 I256: Applied Natural Language Processing Marti Hearst Sept 27, 2006.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

An SVM Based Voting Algorithm with Application to Parse Reranking Paper by Libin Shen and Aravind K. Joshi Presented by Amit Wolfenfeld.

Introduction to Machine Learning Approach Lecture 5.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

The Montclair Electronic Language Learner Database (MELD) Eileen Fitzpatrick & Steve Seegmiller Montclair State.

Evaluation – next steps

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.

The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.

Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.

Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

Automatic Readability Evaluation Using a Neural Network Vivaek Shivakumar October 29, 2009.

Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Stable Multi-Target Tracking in Real-Time Surveillance Video

Copyright © 2013 by Educational Testing Service. All rights reserved. 14-June-2013 Detecting Missing Hyphens in Learner Text Aoife Cahill *, Martin Chodorow.

1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.

Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artiﬁcial Intelligence Laboratory, MIT, Cambridge ACL 2008.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Language Identification and Part-of-Speech Tagging

Experience Report: System Log Analysis for Anomaly Detection

7. Performance Measurement

Approaches to Machine Translation

PRESENTED BY: PEAR A BHUIYAN

Experiments in Machine Learning

The CoNLL-2014 Shared Task on Grammatical Error Correction

The CoNLL-2014 Shared Task on Grammatical Error Correction

Approaches to Machine Translation

Statistical n-gram David ling.

University of Illinois System in HOO Text Correction Shared Task

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

ECE – Pattern Recognition Lecture 8 – Performance Evaluation

Presentation transcript:

1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and Josef van Genabith EMNLP-CoNLL 28 th June 2007 National Centre for Language Technology School of Computing, Dublin City University

2 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

3 Why Judge the Grammaticality? Grammar checking Computer-assisted language learning –Feedback –Writing aid –Automatic essay grading Re-rank computer-generated output –Machine translation

4 Why this Evaluation? No agreed standard Differences in –What is evaluated –Corpora –Error density –Error types

5 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

6 Deep Approaches Precision grammar Aim to distinguish grammatical sentences from ungrammatical sentences Grammar engineers –Avoid overgeneration –Increase coverage For English: –ParGram / XLE (LFG) –English Resource Grammar / LKB (HPSG)

7 Shallow Approaches Real-word spelling errors –vs grammar errors in general Part-of-speech (POS) n-grams –Raw frequency –Machine learning-based classifier –Features of local context –Noisy channel model –N-gram similarity, POS tag set

8 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

9 Common Grammatical Errors 20,000 word corpus Ungrammatical English sentences –Newspapers, academic papers, s, … Correction operators –Substitute (48 %) –Insert (24 %) –Delete (17 %) –Combination (11 %)

10 Common Grammatical Errors 20,000 word corpus Ungrammatical English sentences –Newspapers, academic papers, s, … Correction operators –Substitute (48 %) –Insert (24 %) –Delete (17 %) –Combination (11 %) Agreement errors Real-word spelling errors

11 Chosen Error Types Agreement: She steered Melissa around a corners. Real-word: She could no comprehend. Extra word: Was that in the summer in? Missing word: What the subject?

12 Automatic Error Creation Agreement: replace determiner, noun or verb Real-word: replace according to pre-compiled list Extra word: duplicate token or part-of-speech, or insert a random token Missing word: delete token (likelihood based on part-of-speech)

13 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

14 1 BNC Test Data (1) BNC: 6.4 M sentences 4.2 M sentences (no speech, poems, captions and list items) … Randomisation 10 sets with 420 K sentences each

15 BNC Test Data (2) … … Error creation … … … Agreement Real-word Extra word Missing word Error corpus

16 BNC Test Data (3) … 1 … 1 … Mixed error type ¼ each

17 BNC Test Data (4) ……………50 sets 5 error types: agreement, real-word, extra word, missing word, mixed errors Each 50:50 ungrammatical:grammatical

18 BNC Test Data (5) …………… Training data (if required by method) Test data Example: 1 st cross- validation run for agreement errors

19 Evaluation Measures Precision tp / (tp + fp) Recalltp / (tp + fn) F-score2*pr*re / (pr + re) Accuracy (tp + tn) / total tp := ungrammatical sentences identified as such tp = true positive tn = true negative fp = false positive fn = false negative

20 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

21 Overview of Methods M1M2M3M4M5 XLE Output POS n-gram information Basic methodsDecision tree methods

22 Method 1: Precision Grammar XLE English LFG Fragment rule –Parses ungrammatical input –Marked with * Zero number of parses Parser exceptions (time-out, memory) M1

23 XLE Parsing 110 … 1 … 1 … 1 … 1 … 50 x 60 K = 3 M parse results XLE First 60 K sentences M1

24 Method 2: POS N-grams Flag rare POS n-grams as errors Rare: according to reference corpus Parameters: n and frequency threshold –Tested n = 2, …, 7 on held-out data –Best: n = 5 and frequency threshold 4 M2

25 POS N-gram Information 110 … 1 … 1 … 1 … 1 … 3 M frequency values Rarest n-gram Reference n-gram table Repeated for n = 2, 3, …, 7 9 sets M2

26 Method 3: Decision Trees on XLE Output Output statistics –Starredness (0 or 1) and parser exceptions (-1 = time-out, -2 = exceeded memory, …) –Number of optimal parses –Number of unoptimal parses –Duration of parsing –Number of subtrees –Number of words M3

27 Decision Tree Example Star? <0 >= 0 Star? <1 >= 1 U U Optimal? <5>= 5 UG M3 U = ungrammatical G = grammatical

28 Method 4: Decision Trees on N- grams Frequency of rarest n-gram in sentence N = 2, …, 7 –feature vector: 6 numbers M4

29 Decision Tree Example 5-gram? <4 >= 4 7-gram? <1 >= 1 G U 5-gram? <45>= 45 UG M4

30 Method 5: Decision Trees on Combined Feature Sets Star? <0 >= 0 Star? <1 >= 1 U U 5-gram? <4>= 4 UG M5

31 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

32 Strengths of each Method F-Score

33 Comparison of Methods F-Score

34 Results: F-Score

35 Talk Outline Motivation Background Artificial Error Corpus Evaluation Procedure Error Detection Methods Results and Analysis Conclusion and Future Work

36 Conclusions Basic methods surprisingly close to each other Decision tree effective with deep approach Combined approach best on all but one error type

37 Future Work Error types: –Word order –Multiple errors per sentence Add more features Other languages Test on MT output Establish upper bound

38 Thank You! Djamé Seddah (La Sorbonne University) National Centre for Language Technology School of Computing, Dublin City University

39 Extra Slides P/R/F/A graphs More on why judge grammaticality Precision Grammars in CALL Error creation examples Variance in cross-validation runs Precision over recall graphs (M3) More future work

40 Results: Precision

41 Results: Recall

42 Results: F-Score

43 Results: Accuracy

44 Results: Precision

45 Results: Recall

46 Results: F-Score

47 Results: Accuracy

48 Why Judge Grammaticality? (2) Automatic essay grading Trigger deep error analysis –Increase speed –Reduce overflagging Most approaches easily extend to –Locating errors –Classifying errors

49 Precision Grammars in CALL Focus: –Locate and categorise errors Approaches: –Extend existing grammars –Write new grammars

50 Grammar Checker Research Focus of grammar checker research –Locate errors –Categorise errors –Propose corrections –Other feedback (CALL)

51 N-gram Methods Flag unlikely or rare sequences –POS (different tagsets) –Tokens –Raw frequency vs. mutual information Most publications are in the area of context-sensitive spelling correction –Real word errors –Resulting sentence can be grammatical

52 Test Corpus - Example Missing Word Error She didn’t to face him She didn’t want to face him

53 Test Corpus – Example 2 Context-sensitive spelling error I love then both I love them both

54 Cross-validation Standard deviation below Except Method 4: High number of test items Report average percentage

55 Example RunF-Score Stdev0.001 Method 1 – Agreement errors: 65.4 % average F-Score

56 POS n-grams and Agreement Errors n = 2, 3, 4, 5 Best F-Score 66 % Best Accuracy 55 % XLE parser F-Score 65 %

57 POS n-grams and Context- Sensitive Spelling Errors Best F-Score 69 % n = 2, 3, 4, 5 XLE 60 % Best Accuracy 66 %

58 POS n-grams and Extra Word Errors n = 2, 3, 4, 5 Best F-Score 70 % XLE 62 % Best Accuracy 68 %

59 POS n-grams and Missing Word Errors n = 2, 3, 4, 5 Best F-Score 67 % XLE 53 % Best Accuracy 59 %