Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute.

Slides:



Advertisements
Similar presentations
Automatic Evaluation of Robustness and Degradation in Tagging and Parsing Johnny Bigert, Ola Knutsson, Jonas Sjöbergh Royal Institute of Technology, Stockholm,
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Language modeling for speaker recognition Dan Gillick January 20, 2004.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Automatic Spelling Correction Probability Models and Algorithms Motivation and Formulation Demonstration of a Prototype Program The Underlying Probability.
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Facial Feature Detection
Mining and Summarizing Customer Reviews
Water Contamination Detection – Methodology and Empirical Results IPN-ISRAEL WATER WEEK (I 2 W 2 ) Eyal Brill Holon institute of Technology, Faculty of.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Spring /22/071 Beyond PCFGs Chris Brew Ohio State University.
Text Classification, Active/Interactive learning.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
SI485i : NLP Set 8 PCFGs and the CKY Algorithm. PCFGs We saw how CFGs can model English (sort of) Probabilistic CFGs put weights on the production rules.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Style & Topic Language Model Adaptation Using HMM-LDA Bo-June (Paul) Hsu, James Glass.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Tokenization & POS-Tagging
Language modelling María Fernández Pajares Verarbeitung gesprochener Sprache.
Corpus-based generation of suggestions for correcting student errors Paper presented at AsiaLex August 2009 Richard Watson Todd KMUTT ©2009 Richard Watson.
Chapter 23: Probabilistic Language Models April 13, 2004.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Classification Vikram Pudi IIIT Hyderabad.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Language Model for Machine Translation Jang, HaYoung.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Natural Language Processing Vasile Rus
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Lecture 15: Text Classification & Naive Bayes
BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY
Presentation transcript:

Robust Error Detection: A Hybrid Approach Combining Unsupervised Error Detection and Linguistic Knowledge Johnny Bigert and Ola Knutsson Royal Institute of Technology Stockholm, Sweden

Detection of context-sensitive spelling errors Identification of less-frequent grammatical constructions in the face of sparse data Hybrid method Unsupervised error detection Linguistic knowledge used for phrase transformations

Properties Find difficult error types in unrestricted text (spelling errors resulting in an existing word etc.) No prior knowledge required, i.e. no classification of errors or confusion sets

A first approach Algorithm: for each position i in the stream if the frequency of (t i-1 t i t i+1 ) is low report error to the user report no error

Sparse data Problems: Data sparseness for trigram statistics Phrase and clause boundaries may produce almost any trigram

Sparse data Example: ”It is every manager's task to…” ”It is every” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and has a frequency of zero Probable cause: out of a million words in the corpus, only 709 have been assigned the tag (dt.utr/neu.sin.ind)

Sparse data We try to replace ”It is every manager's task to…” with ”It is a manager's task to…”

Sparse data ”It is every” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr/neu.sin.ind) and had a frequency of 0 ”It is a” is tagged (pn.neu.sin.def.sub/obj, vb.prs.akt, dt.utr.sin.ind) and have a frequency of 231 (dt.utr/neu.sin.ind) had a frequency of 709 (dt.utr.sin.ind) has a frequency 19112

Tag replacements When replacing a tag: All tags are not suitable as replacements All replacements are not equally appropriate… …and thus, we require a penalty or probability for the replacement

Tag replacements To be considered: Manual work to create the probabilities for each tag set and language The probabilities are difficult to estimate manually Automatic estimation of the probabilities (other paper)

Tag replacements Examples of replacement probabilities: Mannen var glad. (The man was happy.) Mannen är glad. (The man is happy.) 100% vb.prt.akt.kopvb.prt.akt.kop 74% vb.prt.akt.kopvb.prs.akt.kop 50% vb.prt.akt.kopvb.prt.akt____ 48% vb.prt.akt.kopvb.prt.sfo

Tag replacements Examples of replacement probabilities: Mannen talar med de anställda. (The man talks to the employees.) Mannen talar med våra anställda. (The man talks to our employees.) 100%dt.utr/neu.plu.defdt.utr/neu.plu.def 44%dt.utr/neu.plu.defdt.utr/neu.plu.ind/def 42%dt.utr/neu.plu.defps.utr/neu.plu.def 41%dt.utr/neu.plu.defjj.pos.utr/neu.plu.ind.nom

Weighted trigrams Replacing (t 1 t 2 t 3 ) with (r 1 r 2 r 3 ): f = freq(r 1 r 2 r 3 ) · penalty penalty = Pr[replace t 1 with r 1 ] · Pr[replace t 2 with r 2 ] · Pr[replace t 3 with r 3 ]

Weighted trigrams Replacement of tags: Calculate f for all representatives for t 1, t 2 and t 3 (typically 3 · 3 · 3 of them) The weighted frequency is the sum of the penalized frequencies

Algorithm Algorithm: for each position i in the stream if weighted freq for (t i-1 t i t i+1 ) is low report error to the user report no error

An improved algorithm Problems with sparse data Phrase and clause boundaries may produce almost any trigram Use clauses as the unit for error detection to avoid clause boundaries

Phrase transformations We identify phrases to transform rare constructions to those more frequent Replacing the phrase with its head Removing phrases (e.g. AdvP, PP)

Phrase transformations Example: Alla hundar som är bruna är lyckliga (All dogs that are brown are happy) Hundarna är lyckliga (The dogs are happy) NP

Phrase transformations Den bruna (jj.sin) hunden (the brown dog) De bruna (jj.plu) hundarna (the brown dogs)

Phrase transformations The same example with a tagging error: Alla hundar som är bruna (jj.sin) är lyckliga (All dogs that are brown are happy) Robust NP detection yield Hundarna är lyckliga (The dogs are happy) NP

Results Error types found: context-sensitive spelling errors split compounds spelling errors verb chain errors

Comparison between probabilistic methods The unsupervised method has a good error capacity but also a high rate of false alarms The introduction of linguistic knowledge dramtically reduces the number of false alarms

Future work The error detection method is not only restricted to part-of-speech tags - we consider adopting the method to phrase n-grams Error classification Generation of correction suggestions

Summing up Detection of context-sensitive spelling errors Combining an unsupervised error detection method with robust shallow parsing

Internal Evaluation POS-tagger: 96.4% NP-recognition: P=83.1% and R=79.5% Clause boundary recognition: P=81.4% and 86.6%