Automating rule generation for grammar checkers

Slides:



Advertisements
Similar presentations
Microsoft® Word 2010 Training
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Language tools for writers Ola Knutsson IPLab, NADA, KTH Sweden.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Word Lesson 3 Helpful Word Features © 2012 M and K Solutions, LLC -- All Rights Reserved.
Aki Hecht Seminar in Databases (236826) January 2009
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Introducing Java.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Microsoft Office Word 2003 Tutorial 1 Creating a Document.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Standard Grade Computing General Purpose Packages WORD-PROCESSING WORD-PROCESSING Chapter 2.
Bug Localization with Machine Learning Techniques Wujie Zheng
1 CSA4050: Advanced Topics in NLP Spelling Models.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Microsoft ® Word 2010 Training Create your first Word document I.
Intermediate 2 Computing Unit 2 - Software Development Topic 2 - Software Development Languages and Environments.
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Proofing Documents Lesson 9 #1.09.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
January 2012Spelling Models1 Human Language Technology Spelling Models.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Extending LanguageTool, a style and grammar checker Daniel Naber, Marcin Miłkowski 11:15 – 13:00 Friday, Workshop Notes v2.0, updated
Language Identification and Part-of-Speech Tagging
Automatic Writing Evaluation
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Approaches to Machine Translation
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Microsoft Word 2010 Prepared 2/20/11 Objectives:
Learning Usage of English KWICly with WebLEAP/DSR
Second Language Acquisition
CSCI-235 Micro-Computer Applications
SURFBRD Michael Margel Dec CSC 2524.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Now with less set theory!
ENGLISH TEST 45 Minutes – 75 Questions
Statistical NLP: Lecture 13
Algorithm and Ambiguity
Microsoft Word - Formatting Pages
Lesson 5 Formatting a Presentation
Lesson 5 Formatting a Presentation
Training Tree Transducers
Lesson 5 Formatting a Presentation
Clustering Algorithms for Noun Phrase Coreference Resolution
Topics in Linguistics ENG 331
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
The CoNLL-2014 Shared Task on Grammatical Error Correction
Productivity Programs Common Features and Commands
Automatic Detection of Causal Relations for Question Answering
The CoNLL-2014 Shared Task on Grammatical Error Correction
Approaches to Machine Translation
Do humans beat computers at pattern recognition? Andra Miloiu Costina
Statistical n-gram David ling.
Algorithms and Problem Solving
Introduction to Text Analysis
University of Illinois System in HOO Text Correction Shared Task
Evaluating Classifiers
1 Word Processing Part I.
Corpus Size and the Robustness of Measures of Corpus Distance
Preposition error correction using Graph Convolutional Networks
CMSC 202 Exceptions.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Extracting Why Text Segment from Web Based on Grammar-gram
The BAWE Quicklinks project
Presentation transcript:

Automating rule generation for grammar checkers Marcin Miłkowski, IFiS PAN

Presentation Outline Why we need automatically generated rules Three approaches: A priori Mixed Fully corpora-based Some examples

Motivation Writing grammar checker rule requires both linguistic knowledge and programming skills Hard to find especially for open source projects such as LanguageTool

Three approaches A priori: formalizing usage dictionaries, spelling dictionaries, formalized grammars, prescriptions... Mixed: training rules based on some of the above and clean corpora Fully corpora-based...

A priori approach Pros: Dictionary-based knowledge tends to be reliable Cons Still expensive Dictionaries tend to be outdated Minority languages can lack explicit prescriptive dictionaries

A priori approach - cons Low recall: Some prescriptions are outdated – for example, Polish usage dictionaries warn against bureaucratic newspeak imported from Russian New kinds of errors can be overlooked – for example, words can be typed as separate because of incomplete spelling checkers

A priori approach - cons Low precision: In dictionaries, most relevant context information is skipped altogether No information about exceptions to the rules presented as universal This results in false alarms for the end user...

Mixed approach Creating rules by training them on corpora and some dictionary-based or automatically induced errors Many algorithms can be used TBL is one of the nicer, as it works fine with smaller corpora But e.g., phrase-based SMT could be used as well

TBL rules TBL algorithm tries to find the transformation rules from a predefined set TBL rules can be straightforwardly translated into LanguageTool declarative pattern rules

TBL Learning TBL is usually used to extract the transformation rule that corrects incorrect POS tagging (Brill tagger) But many other tasks are similar (based on classification) Brill's idea: substitute POS tags for commonly confused words, and you will find rules that correct errors in the text

Original confusion set idea Instead of confused POS tags, make sets of confused words: {there, their}, {it's, its}, {advise, advice}... Take a clean corpus, and substitute the correct words for the members of the confusion pair that they belong to Run TBL learning on the corpus

TBL Learning In original Brill papers, the members of confusion sets were taken from the dictionary (mixed approach) The confusion sets can be also created automatically Without a corpus Or with a corpus

Automating confusion sets... You can simply garble the original text by replacing correct tokens with garbled ones in a clean corpus But you should garble them in an interesting way...

Automatic confusion sets... Not all artificial errors are relevant: most of them will result with 0% recall on normal text The case of autocorrection rules for Polish in early editions of OpenOffice.org and MS Word So you need to constrain the search space of errors

Searching for errors You can simulate mechanical errors: Simulate spelling errors invisible for spelling checkers by replacing characters that are probable to be mistaken using a given input method Based on keyboard layout for typed text Based on character shape for OCR...

Searching for errors But it's much harder to simulate cognitive errors such as: Confused nouns Non-standard versions of idioms False friends in translated text...

So let's use corpora directly Error corpora are a scarce resource, but you can have them from revision history of Wikipedia Frequent minor edits tend to be related to linguistic mistakes and terminology conventions For some uses, automatic a rule for imposing a convention could be desired

Using error corpora It might seem that you can simply use the corpus with errors (or revisions) directly But such a corpus tends to have high recall with low precision: not enough examples for members of the same confusion set to discriminate the cases

Example See rules generated with fnTBL toolkit (the names of features kept for those who are familiar with Brill tagger) This sample is based on training on Holbrook corpus of spelling mistakes

Rules from Holbrook corpus GOOD:21 BAD:0 SCORE:21 RULE: pos_0=Jame word_0=Jame => pos=James GOOD:10 BAD:0 SCORE:10 RULE: pos_0=two pos_1=the word_0=two word_1=the => pos=to GOOD:5 BAD:0 SCORE:5 RULE: pos_0=cave word_0=cave => pos=gave

Why is that? TBL tends to find simple unconditional rule: a → b, so that any “Jame” will be replaced by “James”, and “cave” by “gave” They may be right but only if the corpus is large enough...

Solution You can however extract the confusions and apply them as before to a clean, but a larger corpus You will have more examples – they are also scarce in larger error corpora

Rules based on new method GOOD:4845 BAD:0 SCORE:4845 RULE: pos_0=end word:[-3,-1]=, => pos=and GOOD:1443 BAD:0 SCORE:1443 RULE: pos_0=end word:[1,3]=PRP => pos=and GOOD:839 BAD:0 SCORE:839 RULE: pos_0=end pos:[1,3]=. => pos=and

Comparison 1 – Holbrook corpus Naïve Method Training corpus Recall: 20.34% Precision: 95.11% Brown corpus: Precision: 0% Recall: 0% Mixed Method: Training Corpus Recall: 99,43% Precision: 99,97% Brown Corpus: Precision: 38% Recall: 100%

Comparison 2 – PL Wikipedia revision corpus Naïve Method Training corpus Precision: 80.14% Recall: 33.60% Frequency dict Corpus: Precision: 0.34% Recall: 100% Mixed Method: Training Corpus Precision: 100% Recall: 99.99% Frequency dict Corpus: Precision: 34.65% Recall: 100%

Caveat Insertion and deletion require slightly different handling than replacement (i.e., different rule templates for TBL learning and different substitution) Stopwords in confusion sets will create too large training corpora if not throttled

Some examples Some rules for Polish and English were created this way in LanguageTool We were able to find more instances of errors (higher recall, in some cases by 100%) and lower the cases of false alarms (higher precision)

English - muTBL rules wd:they>the <- tag:'DT'@[5] o wd:they>the <- tag:'NN'@[-2] o wd:they>the <- tag:'JJ,NN,RB'@[1] o wd:they>the <- tag:'NNS,VBZ'@[5] o wd:there>their <- wd:of@[-1] o wd:they>the <- tag:'JJ,NN'@[1] o

Future work Port current framework (mostly AWK scripts and Unix tools) to Java, including bootstraping the error corpus from Wikipedia history dump Have automatic translation from TBL rules to LT formalism

THANK YOU