Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven.

Slides:



Advertisements
Similar presentations
CNTS LTG (UA) (i) Phoneme-to-Grapheme (ii) Transcription-to-Subtitles Bart Decadt Erik Tjong Kim Sang Walter Daelemans.
Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Chapter 18: The Chi-Square Statistic
GWDAW 11 - Potsdam, 19/12/ Coincidence analysis between periodic source candidates in C6 and C7 Virgo data C.Palomba (INFN Roma) for the Virgo Collaboration.
Segmentation of Touching Characters in Devnagari & Bangla Scripts Using Fuzzy MultiFactorial Analysis Presented By: Sanjeev Maharjan St. Xavier’s College.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Statistics Versus Parameters
A method for unsupervised broad-coverage lexical error detection and correction 4th Workshop on Innovative Uses of NLP for Building Educational Applications.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
Philip Markle Environmental Scientist
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Decision Tree Rong Jin. Determine Milage Per Gallon.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
PSY 307 – Statistics for the Behavioral Sciences
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Chapter 11 Integration Information Instructor: Prof. G. Bebis Represented by Reza Fall 2005.
Decision trees and empirical methodology Sec 4.3,
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
AM Recitation 2/10/11.
CHAPTER 4 Research in Psychology: Methods & Design
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
A Feedback-Augmented Method for Detecting Errors in the Writing of Learners of English Ryo Nagata et al. Hyogo University of Teacher Education ACL 2006.
: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
User Study Evaluation Human-Computer Interaction.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Learning from Observations Chapter 18 Through
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Table of Contents Chapter 4 (The Art of Modeling with Spreadsheets) The Everglade Golden Years Co. Cash Flow Problem (Section 4.1)4.2–4.3 The Process of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
© 2011 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.
Confidence Intervals With z Statistics Introduction Last time we talked about hypothesis testing with the z statistic Just substitute into the formula,
Using the WWW to resolve PP attachment ambiguities in Dutch Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U.Leuven, Belgium.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Validation & Verification Today will look at: The difference between accuracy and validity Explaining sources of errors and how they could be overcome.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
10. Decision Trees and Markov Chains for Gene Finding.
Data Science Credibility: Evaluating What’s Been Learned
Table of Contents Chapter 4 (The Art of Modeling with Spreadsheets)
Erasmus University Rotterdam
Roberto Battiti, Mauro Brunato
Inverse Transformation Scale Experimental Power Graphing
Chapter 7 Lexical Analysis and Stoplists
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Introduction to Text Analysis
Ali Hakimi Parizi, Paul Cook
Chapter 18: The Chi-Square Statistic
Presentation transcript:

Automated Compounding as a means for Maximizing Lexical Coverage Vincent Vandeghinste Centrum voor Computerlinguïstiek K.U. Leuven

Maximizing Lexical Coverage Target: Reduction of the number of OOV-words Means: –accurate content and organization of the recognizer lexicon –taking care of a number of productive word formation processes Evaluation: –implementation of test tool –test results Conclusions

Lexicon: Content & Organization Starting point: CGN-lexicon ( entries) Reduction to one entry per wordform per POS ( entries) Removal of compounds ( entries) Selection of most frequent entries (40.000) => Basic Word List (BWL) Quasi-Word List (QWL): Compounding word parts which don’t appear in BWL

Lexicon Accuracy Careful selection of the words in BWL: –no compounds –frequent words Organization of the lexicon: maximal applicability of compounding rules through lexicon split into BWL and QWL

Word Formation Processes Input: number of word parts that can or cannot be compounded Hybrid approach: Rule-based + Statistical Filters Output: –compound + morfo-syntactic info + confidence measure –no compounding possible with given word parts

Word Formation Processes: Input From BWL: full words, that can be part of a compound or can be words by themselves From QWL: ‘words’ that can only be part of a compound 2 up to 5 word parts

Word Formation Processes: Rules Making use of rules for word formation: e.g.: modifier (N) + head (N) => compound (N) Input from QWL: word part is N and can only be modifier Input from BWL: word is looked up in CGN: morfo-syntactic info is used in rules Rules use 2 word parts When input > 2 word parts: recursivity in rules

Word Formation Processes: Statistics Relative Frequency Threshold Parameter Confidence Measure of the Compound Probability

Relative Frequency Threshold Makes use of relative frequency of POS for a word form Makes use of a threshold value (0.05%) If RF > Threshold: POS is used for this wordform If RF < Threshold: POS is rejected for this wordform Example: RF(bij (PREP) ) = > T, RF(bij (N) ) = <T, only bij (PREP) is used

Confidence Measure of Compounding Probability estimation of: P(comp (w1=mod, w2=head) ) / P(comp (w1=*, w2=head) ) where: –P(comp (w1=mod, w2=head) ) is the probability that two consecutive word parts form a compound rather than being 2 separate words –P(comp (w1=*, w2=head) is the probability of w 2 being a head, with any modifier

Confidence Measure of Compound Probability (2) If the compound is found in the frequency list, the ratio is estimated like this: [Fr(comp (w1=mod, w2=head) )/Fr(comp (w1=*,w2=head) )] x (1-D head ) where: –Fr(comp (w1=mod, w2=head) ) is the frequency of the compound that consists of w 1 + w 2 –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier –D head is the discount parameter: amount of probability reserved for words not in frequency list

Confidence Measure of Compounding Probability (3) Discount parameter is estimated: D head = #diff(mod | head) / Fr(comp (w1=*, w2=head) ) where: –#diff(mod | head) is the number of different modifiers occuring with the given head –Fr(comp (w1=*, w2=head) ) is the frequency of the 2 nd word part as a head, with any modifier (1-D head ) is the amount of probability reserved for words that can be found in the frequency list

Confidence Measure of Compounding Probability (4) If the compound is not found in the frequency list, the ratio is estimated like this: D head x [Fr(comp (w1=mod, w2=*) ) / Fr (*) ] where: –Fr(comp (w1=mod, w2=*) ) is the frequency of the 1 st word part as a modifier of any head –Fr (*) is the total frequency of all words in the frequency list (= )

Confidence Measures: Examples binnen+kijken –binnenkijken occurs in the frequency list –Fr (w1=binnen, w2=kijken) = 10 –Fr (w1=*, w2=kijken) = 2188 –#diff( mod | head=kijken) = 21 –(10 / 2188) x (1 - 21/2188) = frequentie + tabel –frequentietabel does not occur in frequency list –Fr (w1=*, w2=tabel) = 141 –#diff( mod | head=tabel) = 17 –Fr (w1=frequentie,w2=*) = 15 –(17 / 141) x (15 / ) = 2.26 e -8

Evaluation Test System Test Results

The Test System Takes a regular text as input Converts punctuation marks into # For the test system, a BWL of entries was used Every word is checked in BWL: –if word is not present in BWL: word gets split up in a modifier (QWL or BWL) and a head (BWL) –no compounding rules are used for split up procedure –if no possible split up is found, split up in 3 parts is tried If a word can’t be found in BWL, and can’t be split up, it is classified as an OOV-word

The Test System (2) For every 2 consecutive word parts, it was tested whether they can be compounded or not Results are compared with original text False compounding and false identification of noncompounds can be counted this way Same was done for every 3 consecutive word parts A threshold was set on the Confidence Measure: If Confidence Measure < Threshold, compound is rejected

Test Results 3 test texts were used: –Thuis (dialogue of soap series): 3415 words, 3.08% OOV, 1.47 % compounds –Aspe (chapter of a novel): 4589 words, 3.77% OOV, 6.08 % compounds –Interview (transcript of spontaneous speech): 4645 words, 0.84% OOV, 2.95 % compounds Most of the OOV’s are proper nouns or non- standard Dutch

Test Results (2) Correct identification of noncompounds and compounds: –dependent on test text –dependent on parameter thresholds There is a nearly perfect negative correlation ( -0.98) between the optimal confidence threshold and the amounts of compounds in the test text

Test Results (3)

Conclusions Identifying compoundability can be done with an accuracy of % Lexical coverage can be assured with OOV’s between 0.8 and 3.8 % and a lexicon with a total size of entries (BWL+QWL)

Conclusions (2) Capturing already existing compounds by automated compounding proves to be successful Capturing new formed compounds proves to be a lot harder: the accuracy is a lot lower Automated compounding proves to be a useful means for maximizing lexical coverage