Automating rule generation for grammar checkers

Automating rule generation for grammar checkers
Marcin Miłkowski, IFiS PAN

Presentation Outline Why we need automatically generated rules
Three approaches: A priori Mixed Fully corpora-based Some examples

Motivation Writing grammar checker rule requires both linguistic knowledge and programming skills Hard to find especially for open source projects such as LanguageTool

Three approaches A priori: formalizing usage dictionaries, spelling dictionaries, formalized grammars, prescriptions... Mixed: training rules based on some of the above and clean corpora Fully corpora-based...

A priori approach Pros:
Dictionary-based knowledge tends to be reliable Cons Still expensive Dictionaries tend to be outdated Minority languages can lack explicit prescriptive dictionaries

A priori approach - cons
Low recall: Some prescriptions are outdated – for example, Polish usage dictionaries warn against bureaucratic newspeak imported from Russian New kinds of errors can be overlooked – for example, words can be typed as separate because of incomplete spelling checkers

A priori approach - cons
Low precision: In dictionaries, most relevant context information is skipped altogether No information about exceptions to the rules presented as universal This results in false alarms for the end user...

Mixed approach Creating rules by training them on corpora and some dictionary-based or automatically induced errors Many algorithms can be used TBL is one of the nicer, as it works fine with smaller corpora But e.g., phrase-based SMT could be used as well

TBL rules TBL algorithm tries to find the transformation rules from a predefined set TBL rules can be straightforwardly translated into LanguageTool declarative pattern rules

TBL Learning TBL is usually used to extract the transformation rule that corrects incorrect POS tagging (Brill tagger) But many other tasks are similar (based on classification) Brill's idea: substitute POS tags for commonly confused words, and you will find rules that correct errors in the text

Original confusion set idea
Instead of confused POS tags, make sets of confused words: {there, their}, {it's, its}, {advise, advice}... Take a clean corpus, and substitute the correct words for the members of the confusion pair that they belong to Run TBL learning on the corpus

TBL Learning In original Brill papers, the members of confusion sets were taken from the dictionary (mixed approach) The confusion sets can be also created automatically Without a corpus Or with a corpus

Automating confusion sets...
You can simply garble the original text by replacing correct tokens with garbled ones in a clean corpus But you should garble them in an interesting way...

Automatic confusion sets...
Not all artificial errors are relevant: most of them will result with 0% recall on normal text The case of autocorrection rules for Polish in early editions of OpenOffice.org and MS Word So you need to constrain the search space of errors

Searching for errors You can simulate mechanical errors:
Simulate spelling errors invisible for spelling checkers by replacing characters that are probable to be mistaken using a given input method Based on keyboard layout for typed text Based on character shape for OCR...

Searching for errors But it's much harder to simulate cognitive errors such as: Confused nouns Non-standard versions of idioms False friends in translated text...

So let's use corpora directly
Error corpora are a scarce resource, but you can have them from revision history of Wikipedia Frequent minor edits tend to be related to linguistic mistakes and terminology conventions For some uses, automatic a rule for imposing a convention could be desired

Using error corpora It might seem that you can simply use the corpus with errors (or revisions) directly But such a corpus tends to have high recall with low precision: not enough examples for members of the same confusion set to discriminate the cases

Example See rules generated with fnTBL toolkit (the names of features kept for those who are familiar with Brill tagger) This sample is based on training on Holbrook corpus of spelling mistakes

Rules from Holbrook corpus
GOOD:21 BAD:0 SCORE:21 RULE: pos_0=Jame word_0=Jame => pos=James GOOD:10 BAD:0 SCORE:10 RULE: pos_0=two pos_1=the word_0=two word_1=the => pos=to GOOD:5 BAD:0 SCORE:5 RULE: pos_0=cave word_0=cave => pos=gave

Why is that? TBL tends to find simple unconditional rule: a → b, so that any “Jame” will be replaced by “James”, and “cave” by “gave” They may be right but only if the corpus is large enough...

Solution You can however extract the confusions and apply them as before to a clean, but a larger corpus You will have more examples – they are also scarce in larger error corpora

Rules based on new method
GOOD:4845 BAD:0 SCORE:4845 RULE: pos_0=end word:[-3,-1]=, => pos=and GOOD:1443 BAD:0 SCORE:1443 RULE: pos_0=end word:[1,3]=PRP => pos=and GOOD:839 BAD:0 SCORE:839 RULE: pos_0=end pos:[1,3]=. => pos=and

Comparison 1 – Holbrook corpus
Naïve Method Training corpus Recall: 20.34% Precision: 95.11% Brown corpus: Precision: 0% Recall: 0% Mixed Method: Training Corpus Recall: 99,43% Precision: 99,97% Brown Corpus: Precision: 38% Recall: 100%

Comparison 2 – PL Wikipedia revision corpus
Naïve Method Training corpus Precision: 80.14% Recall: 33.60% Frequency dict Corpus: Precision: 0.34% Recall: 100% Mixed Method: Training Corpus Precision: 100% Recall: 99.99% Frequency dict Corpus: Precision: 34.65% Recall: 100%

Caveat Insertion and deletion require slightly different handling than replacement (i.e., different rule templates for TBL learning and different substitution) Stopwords in confusion sets will create too large training corpora if not throttled

Some examples Some rules for Polish and English were created this way in LanguageTool We were able to find more instances of errors (higher recall, in some cases by 100%) and lower the cases of false alarms (higher precision)

English - muTBL rules wd:they>the <- tag:'DT'@[5] o
wd:they>the <- o wd:they>the <- o wd:they>the <- o wd:there>their <- o wd:they>the <- o

Future work Port current framework (mostly AWK scripts and Unix tools) to Java, including bootstraping the error corpus from Wikipedia history dump Have automatic translation from TBL rules to LT formalism

THANK YOU

Automating rule generation for grammar checkers

Similar presentations

Presentation on theme: "Automating rule generation for grammar checkers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automating rule generation for grammar checkers

Similar presentations

Presentation on theme: "Automating rule generation for grammar checkers"— Presentation transcript:

Similar presentations

About project

Feedback