Download presentation
Presentation is loading. Please wait.
1
LanguageTool - Part A David Ling
2
LanguageTool LanguageTool -- Open source Java program
Language_check -- python wrapper of LanguageTool, supports only up to v3.5 (currently v3.9) To use, you can double click ‘languagetool.jar’, or Run as a local host http server via cmd Main papers Daniel Naber, A Rule-Based Style and Grammar Checker, Diploma Thesis, University of Bielefeld, 2003 Marcin Miłkowski, Developing an open-source, rule-based proofreading tool, Software – Practice and Experience 2010, 40 (7), pp DOI: /spe.971
3
Rules in LanguageTool Xml rules Java rules
grammar.xml (collaborative) Java rules Rules cannot be handled by xml rules (eg. missing of closing parenthesis, a space after comma) Spell checking n-gram frequency for potential homophones (like there - their) There are only a few Java rules (according to Marcin’s paper in 2010) xml rules use the following input features: word token part of speech of the token – postag (from dictionary) chunk tag of the (by opennlp)
4
Xml rules Categories of xml rules Number of rules 1 Possible typo 506
2 Grammar 405 3 Collocations 9 4 Miscellaneous 21 5 Punctuation Errors 48 6 Commonly Confused Words 241 7 Nonstandard Phrases 8 Redundant Phrases 159 Style 17 10 Semantic 13 11 Plain English (default: off) 92 12 Wikipedia (default: off) Typography 14 Misused terms in EU publications, Gardner (default: off) 149 Total: 1704
5
Xml rules – possible typo
Notes: MD: modal words JJ.? : adjective VBN: verb, past participle DT: determiner: an, an, all, … rule name = "'as follow' (as follows) " as follow [\.:,—\-–] suggests “as follows” rule name = "'by' + passive participle (be) " postag = "MD " by postag = "JJ.?|VBN“, except postag = "DT" suggests “be” Example: This can by consistent with… This can be consistent with Example: It can by found. It can be found.
6
Xml rules – possible typo
Notes: VB[DNPZ]?“: verb infected: use, uses, used, … Xml rules – possible typo rule name="miss use (misuse) “ miss understand|spell|use|place|lead|…|dial, inflected, postag="VB[DNPZ]?“ suggests “mis”+token Example: These words are miss used. These words are misused. Other randomly selected rules: land lover (landlubber) <correction="landlubber">The sailors considered John to be a serious land lover. I/you/... thing (think) <correction="think|thinks">I thing that's a good idea. to get ride (rid) of <correction="rid"> Let's get ride of that broken chair.
7
Xml rules - Grammar Rule name = "will follows be ('he is would') "
Notes: WP: wh-pronoun: that, whatever, what,… WRB: wh-adverb: however, how,… VB.*: verb MD: modal words infected: be, is, am, are Xml rules - Grammar Rule name = "will follows be ('he is would') " postag = " W(RB|P) " be, infected will|must, infected message: redundant Example: How is would this approach be useful? How is this … or How would this… Rule name="missing verb after 'if there'“ if, <exception scope="previous">as</exception> there <exception postag="VB.*|MD" /> <exception>[´`'’]</exception> message: missing verb Example: If there one who has … If there is one who has …
8
Randomly selected xml rules in Grammar
some faculty... (some faculty members...) < correction="faculty members">Three faculty support the change. all/most/some (of) + noun < correction="All students|All of the students">All of students like mathematics. both... as well as (and) < correction="and">He is both very rich as well as handsome. Use of past form with 'going to ...' < correction="write">I'm going to wrote him. Who + verb (who know's/knows) < correction="Who cares">Who care's? inspired with (by) < correction="inspired by">The artist was inspired with the beauty of the mountains. beware PREPOSITION < correction="Beware of">Beware about malware. objective case after with(out)/at/to/... < correction="to me|to her|to him|to us|to them">Give it to I.
9
xml rules – commonly confused words
rule name ="and than (then) " and|since than suggest: then rule name="rather/other/different then (than) " rather|other then suggest: than Other rule names: turned of (off) 'economical (economic) growth' etc. in the passed (in the past) too go (to go)
10
xml rules – redundant phrases & punctuations
absolutely essential/necessary (essential/necessary) < correction="essential">This is absolutely essential. established fact (fact) < correction="a fact">This is an established fact. there are also other (also) < correction="there are other|there are also">However, there are also other marbles in the jar. Punctuations extraneous apostrophes before ‘are’ < correction="cars">The car's are cheap. Comma after a month < correction="October 1958">The store closed its doors for good in October, 1958. Missing comma between day of month and year < correction="October 18,">My birthday is October
11
N-gram data rule Resolve confusing words pair, like their and there
Given a confusion list (currently ~600 pairs): eg. (their, there; adapting, adopting) Input sentence: This is there last chance to escape. System will consider 3-gram frequency of ‘there’ with ‘their’: This is there, is there last, there last chance This is their, is their last, their last chance Recommend using their if the probability ratio is greater than a ratio Remarks: n-gram data is from google book ngram viewer Someone is developing word2vec to calculate the probability instead of the 3-gram (context: {this, is, last, chance}, guessing {there, their})
12
Next time other xml rules spell check chunking by opennlp references:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.