Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
A new Machine Learning algorithm for Neoposy: coining new Parts of Speech Eric Atwell Computer Vision and Language group School of Computing University.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
August 6 th ISAAC 2008 Word Prediction in Hebrew Preliminary and Surprising Results Yael Netzer Meni Adler Michael Elhadad Department of Computer Science.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Stemming, tagging and chunking Text analysis short of parsing.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Learning Models for Object Recognition from Natural Language Descriptions Presenters: Sagardeep Mahapatra – Keerti Korrapati
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Part-of-speech tagging
2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat,
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Language Identification and Part-of-Speech Tagging
CSCI 5832 Natural Language Processing
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio

Unknown Words - English The draje of the tagement starts rikking with Befa. Morphology unknown wordanalysis 1 prob 1 … analysis n prob n analysis i prob i Syntax Motivation

Unknown Words - English The draje of the tagement starts rikking with Befa.  Morphology tagement, rikking, Befa  Syntax: The draje of the tagement starts rikking with Befa. Motivation

Unknowns Resolution Method in English Baseline Method  PN tag for capitalized words  Uniform distribution over open-class tags Evaluation  12 open-class tags  45% of capitalized unknowns  Overall, 70% of the unknowns were tagged correctly Motivation

Unknowns Resolution Method in Hebrew The baseline method resolves: only 5% of the Hebrew unknown tokens! Why? How can we improve? Motivation

Unknown Words - Hebrew The draje of the tagement starts rikking with Befa. עם בפה דרג הטגמנט התחיל לנפן drj htgmnt hthil lnpn `m bfh Motivation

Unknown Words - Hebrew drj htgmnt hthil lngn `m bfh  Morphology No capitalization: PN always candidate Many open-class tags (> 3,000)  Syntax Unmarked function words  preposition, definite article, conjunction, pronominal pronoun.  the drj of  drj Function words ambiguity  htgmnt: VBP/VBI, DEF+MM  `m: PREP (with), NN (people)  bfh: PREP+NN/PN/JJ/VB, PREP+DEF+NN/JJ… Motivation

Outline Characteristics of Hebrew Unknowns Previous Work Unsupervised Lexicon-based Approaches  Letters Model  Pattern Model  Linear-context Model Evaluation Conclusion

Hebrew Text Analysis System Hebrew Unknowns Characteristics Unknown Tokens Analyzer Tokenizer Morphological Analyzer Morphological Disambiguator Proper-name Classifier Named-Entity Recognizer Noun-Phrase Chunker SVM ME Lexicon SVM HMM ? text tokenized text known words analysis distribution unknown words analysis distribution disambiguated text disambiguated text with PN

Hebrew Unknowns Unknown tokens  Tokens which are not recognized by the lexicon NN: פרזנטור (presenter) VB: התחרדן (got warm under the sun) Unknown analyses  The set of analyses suggested by the lexicon does not contain the correct analysis for a given token PN: שמעון פרס (Shimon Peres, that a dorm cut…) RB: לעומתי (opposition, compared with me) Hebrew Unknowns Characteristics

Hebrew Unknowns - Evidence Unknown Tokens (4%)  Only 50% of the unknown tokens are PN  Selection of default PN POS is not sufficient  More than 30% of the unknown tokens are Neologism  Neologism detection Unknown Analyses (3.5%)  60% of the unknown analyses are proper name  Other POS cover 15% of the unknowns (only 1.1% of the tokens)  PN classifier is sufficient for unknown analyses Hebrew Unknowns Characteristics

Hebrew Unknown Tokens Analysis Objectives  Given an unknown token, extract all possible morphological analyses, and assign likelihood for each analysis  Example: התחרדן (got warm in the sun)  verb.singular.masculine.third.past0.6  Proper noun 0.2  noun.def.singular.masculine 0.1  noun.singular.masculine.absolute 0.05  noun.singular.masculine.construct  … Hebrew Unknowns Characteristics

Previous Work - English Heuristics [Weischedel et al. 95] Tag-specific heuristics Spelling features: capitalization, hyphens, suffixes Guessing rules learned from raw text [Mikheev 97] HMM with tag-suffix transitions [Thede & Harper 99] Previous Work

Previous Work - Arabic Root-pattern-features for morphological analysis and generation of Arabic dialects [Habash & Rambow 06] Combination of lexicon-based and character- based tagger [Mansour et al. 07] Previous Work

Our Approach Resources  A large amount of unlabeled data (unsupervised)  A comprehensive lexicon (lexicon-based) Hypothesis  Characteristics of unknown tokens are similar to known tokens Method  Tag distribution model, based on morphological analysis of the known tokens in the corpus: Letters model Pattern model Linear-context model Unsupervised Lexicon-based Approach

Notation Token  A sequence of characters bounded with spaces בצל bcl Prefixes  The prefixes according to each analysis Preposition+noun (under a shadow): ב b Base-word  Token without prefixes (for each analysis) Noun (an onion) בצל bcl Preposition+noun (under a shadow): צל cl Unsupervised Lexicon-based Approach

Letters Model For each possible analyses of a given token:  Features Positioned uni-, bi- and trigram letters of the base-word The prefixes of the base-word The length of the base-word  Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Letters Model

Letters Model – An example Known token: בצל bcl Analyses  An onion Base-word: bcl Features  Grams: b:1 c:2 l:3 b:-3 c:-2 l:-1 bc:1 cl:2 bc:-2 cl:-1 bcl:1 bcl:-1  Prefix: none  Length of base-word: 3 Value  noun.singular.masculine.absolute  Under a shadow Features  Grams: c:1 l:2 c:-1 l:-2 cl:1 cl:-1  Prefix: b  Length of base-word: 2 Value  preposition+noun.singular.masculine.absolute Unsupervised Lexicon-based Approach

Pattern Model Word formation in Hebrew is based on root+template and/or affixation. Based on [Nir 93], we defined 40 common neologism formation patterns, e.g.  Verb Template: miCCeC מחזר, tiCCeC תארך  Noun Suffixation: ut שיפוטיות, iya בידוריה Template: tiCCoCet תפרוסת, miCCaCa מגננה  Adjective Suffixation: ali סטודנטיאלי, oni טלויזיוני  Adverb Suffixation: it לעומתית Unsupervised Lexicon-based Approach

Pattern Model For each possible analyses of a given token:  Features For each pattern, 1 – if the token fits the pattern, 0- otherwise ‘no pattern’ feature  Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Pattern Model Patterns

Letters+Pattern Model For each possible analyses of a given token:  Features Letters features Patterns features  Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Letters + Pattern Model Patterns

Linear-context Model The draje of the tagement starts rikking with Befa. P(t|w) is hard to estimate for unknown tokens  P(noun| draje), P(adjective| draje), P(verb| draje) Alternatively, P(t|c), can be learned for known contexts  P(noun| The, of), P(adjective| The, of), P(verb| The, of) Observed Context Information  Lexical distribution Word given context P(w|c) - P(drage|The,of) Context given word P(c|w) - P(The, of | drage) Relative frequencies over all the words in the corpus  Morpho-lexical distribution of known tokens P(t|w i ), - P(determiner|The)…, P(preposition|of)… Similar words alg. [Levinger et al. 95] [Adler 07] [Goldberg et al. 08] Unsupervised Lexicon-based Approach

Linear-context Model Expectation Unsupervised Lexicon-based Approach Notation: w – known word, c – context of a known word, t - tag Initial Conditions raw-text corpus p(w|c), p(c|w) lexicon p(t|w) Maximization p(t|c) = ∑ w p(t|w)p(w|c) p(t|w) = ∑ c p(t|c)p(c|w)

Evaluation Resources  Lexicon: MILA  Corpus Train: unlabeled 42M tokens corpus Test: annotated news articles of 90K token instances (3% unknown tokens, 2% unknown analyses)  PN Classifier Evaluation

Evaluation - Models Baseline  Most frequent tag - proper name - for all possible segmentations of the token Letters model Pattern model Letters + Pattern model Letters, Linear-context Pattern, Linear-context Letters + Pattern, Linear-context Evaluation

Evaluation - Criteria Suggested Analysis Set  Coverage of correct analysis  Ambiguity level (average number of analyses)  Average probability of correct analysis Disambiguation accuracy  Number of correct analyses, picked in the complete system

Evaluation – Full Morphological Analysis Full Morph Analysis SetModel ProbabilityAmbiguityCoverage Baseline Letters Pattern Letters + Pattern Linear-context, Letters Linear-context, Pattern Linear-context, Letters + Pattern Evaluation

Evaluation – Word Segmentation and POS Tagging SEG POS Analysis SetModel ProbabilityAmbiguityCoverage Baseline Letters Pattern Letters+Pattern Linear-context, Letters Linear-context, Pattern Linear-context, Letter+Pattern Evaluation

Evaluation - Conclusion Error reduction > 30% over a competitive baseline, for a large-scale dataset of 90K tokens  Full morphological disambiguation: 79% accuracy  Word segmentation and POS tagging: 70% accuracy Unsupervised linear-context model is as effective as a model which uses hand-crafted patterns Effective combination of textual observation from unlabeled data and lexicon Effective combination of ME model for tag distribution and SVM model for PN classification Overall, error reduction of 5% for the whole disambiguation system Evaluation

Summary The characteristics of known words can help resolve unknown words Unsupervised (unlabeled data) lexicon-based approach Language independent algorithm for computing the distribution p(t|w) for unknown words Nature of agglutinated prefixes in Hebrew [Ben-Eliahu et al. 2008]

תנקס tnqs (thanks)  foreign0.4  propernoun0.3  noun.plural.feminine.absolute0.2  verb.singular.feminine.3.future0.08  verb.singular.masculine.2.future0.02