Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
August 6 th ISAAC 2008 Word Prediction in Hebrew Preliminary and Surprising Results Yael Netzer Meni Adler Michael Elhadad Department of Computer Science.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Hidden Topic Markov Models Amit Gruber, Michal Rosen-Zvi and Yair Weiss in AISTATS 2007 Discussion led by Chunping Wang ECE, Duke University March 2, 2009.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Natural Language Processing Chapter 2 : Morphology.
2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Semantic classification of Chinese unknown words Huihsin Tseng Linguistics University of Colorado at Boulder ACL 2003 Student Research Workshop.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Towards Developing a Multi-Dialect Morphological Analyser for Arabic 4 th International Conference on Arabic Language Processing May 2–3, 2012, Rabat,
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Language Identification and Part-of-Speech Tagging
David Mareček and Zdeněk Žabokrtský
CSC 594 Topics in AI – Natural Language Processing
Prague Arabic Dependency Treebank
Yoav Goldberg and Michael Elhadad
CSCI 5832 Natural Language Processing
Machine Learning in Natural Language Processing
Toward Better Understanding
CS4705 Natural Language Processing
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Statistical Machine Translation Papers from COLING 2004
Classical Part of Speech (PoS) Tagging
Text Mining & Natural Language Processing
Parsing Unrestricted Text
Hindi POS Tagger By Naveen Sharma ( )
Artificial Intelligence 2004 Speech & Natural Language Processing
Chapter Six CIED 4013 Dr. Bowles
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006 Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

unsprvsd mrphm-bsd hmm frhbrw mrflgcl dsmbgtn mni adlr andmchl elhdd bn grn unvrst clng-acl ht$s”b

Hebrew Unvocalized writing inherent  inhrnt Affixation in her note  inhrnt in her net  inhrnt Rich Morphology ‘inhrnt’ could be inflected into different forms according to sing/pl, masc/fem properties. Some morphological properties alternations could also leave ‘inherent’ unmodified (construct/absolute).

Ambiguity Level Size of Tagset These variations create a high level of ambiguity: English lexicon: inherent  inherent.adj With Hebrew word formation rules: inhrnt  in.prep her.pro.fem.poss note.noun  in.prep her.pro.fem net.noun  inherent.adj.masc.absolute  inherent.adj.masc.construct Size of Tagset Hebrew: Theoretically: ~300K, In practice: ~2K English: 45-195 tags Number of possible morphological analyses per word instance: English: 1.4 (Average # words / sentence: 12) Hebrew: 2.4 (Average # words / sentence: 18)

Stochastic English Taggers Supervised: 97% Semi-supervised transformation based: 95% Unsupervised HMM: 75%-86% Unsupervised HMM with good initial conditions: ~94% [Elworthy 95] Supervised HMM with only 100 tagged sentences: ~92% [Merialdo 95]

Hebrew Taggers [Levinger et al. 95] Context-free approximation of morpho-lexical distributions, based on similar word set. 88% reported – 78% tested. [Levinger 94] An expert system, based on manual set of 16 syntactic constraints. 94% accuracy for 85% of the words. [Carmel & Maarek 99] Disambiguation of lemma and part-of-speech. 75% - one analysis with 95% accuracy 20% - 2 analyses 5% - 3 analyses [Segal 2000] Pseudo rule-based transformations method, supervised on a corpus of 5K words. 95% reported – 85% tested [Bar-Haim et al. 2005] Morpheme-based HMM over segmented text, supervised by a corpus of 60K. 90.51% for segmentation and PoS tagging.

Arabic vs. Hebrew Similar Morphology Rich morphology, affixation, unvocalized writing 2,200 tags Average number of 2 analyses per word Selection of most frequent tag for a word Arabic: ~92% (Diab 2004, Habash & Rambow 2005) Hebrew: 72%

Arabic Tagging [Diab et al. 2004] used training corpus. 95.5% on PoS tagging, 92% baseline. [Habash, Rambow 2005] Supervised morphological classifiers based on two sets of 120K words, accuracy: 94.8% - 96.2%. [Duh, Kirchhoff 2005] word-based HMM for PoS tagging of dialectal Arabic. Unsupervised: 69.83%, Supervised: 92.53%

Unsupervised Model Stochastic model, unsupervised learning, exact inference. Motivation Not enough available data for supervised training. Dynamic nature of Modern Hebrew, as it evolves over time (20% new lexemes in a 10 year period). Expectations Larger Training Corpus helps. Good initial conditions help. Small amount of annotated data helps.

First-order word-based HMM inhrnt txt Ti Ti+1 T1 prep + pos + noun.fem.sing.cons T2 noun.masc.sing.abs

First-order word-based HMM inhrnt txt Ti Ti+1 Tags number: 1934 State Transitions: ~250K Lexical Transitions: ~300K

Partial second-order word-based HMM inhrnt txt Ti Ti+1 Ti-1 Tags number: 1934 State Transitions: ~7M Lexical Transitions: ~300K

Second-order word-based HMM inhrnt txt Ti Ti+1 Ti-1 Tags number: 1934 State Transitions: ~7M Lexical Transitions: ~5M

Research Hypothesis The large set of morphological features should be modeled in a compact morpheme model. Morphemes segmentation and tagging should be learned/searched in parallel (in contrast to [Bar-Haim et al. 2005]).

First-order morpheme-based HMM in hr txt nt Ti+3 Ti+2 Ti+1 Ti Ti prep Ti+1 pos pronoun Ti+2 noun.fem.fem.cons Ti+3 noun.masc.sing.abs

First-order morpheme-based HMM in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~20K Lexical Transitions number: ~130K

Partial second-order morpheme-based HMM in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~700K Lexical Transitions number: ~130K

Second-order morpheme-based HMM in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~700K Lexical Transitions number: ~1.7M

Model Sizes B2 B A2 A PI States 5M 300K 7M 250K 834 1934 W 1.7M 130K 145 202 M

Agglutination of the observed Morphemes in hr nt txt

Agglutination of the observed Morphemes inhrnt txt

Text Representation of inhrnt txt Tag Segmentation Word adj.masc.sing.abs inhrnt adj.masc.sing.cons prep+pos+noun.fem.sing.cons in-hr-nt prep+pos+noun.masc.sing.cons noun.masc.sing.abs txt noun.masc.sing.cons

Text Representation of inhrnt txt EOS 15 nt 13 nt 11 EOS 17 EOS 16

Multi Words Expression arrive 7 in time 10 in 5 EOS 16 time 11 EOS 17

Learning and Searching The learning and searching algorithms were adapted in order to support the new text representation. The complexity of the algorithms is O(T’) where T’ is the number of transitions in the sentence in the new representation.

The Training Corpus Daily news About 6M words 178,580 different words 64,541 different lexemes Average number of analyses per word: 2.4 Initial morpho-lexical probabilities according to [Levinger, Ornan, Ittai 95]

Morphological Disambiguation Context Free Uniform Order Model Type 84.08 82.01 1 W 85.75 80.44 2- 85.78 79.88 2 84.54 81.08 M 88.5 81.53 85.83 83.39

Analysis Baseline: 78.2% (Levinger et al – similar words) Error reduction Contextual information: 17.5% (78.2  82.01). Initial conditions: 11.5% - 37.8% (82.01  84.08 for w model 1, and 81.53  88.5 for m model 2-) Model order: 2- produced the best results for both word (85.75%) and morpheme (88.5%) models. Model type: morpheme model reduced about 19.3% of the errors (85.75  88.5)

Segmentation and PoS Tagging CF Uniform Order Model Type 91.47 91.07 1 W 91.93 90.45 2- 91.84 90.21 2 91.42 89.23 M 91.76 89.77 92.32

Confusion Matrix % of errors Error Correct 17.9 noun proper name 15.3 verb 6.6 6.3 5.4 adjective 5.0

Unknown Words 20% of the word types in the training corpus have no analysis found by the analyzer. 7.5% of the word instances of the test corpus (30K words) do not have a correct analysis proposed by the analyzer: 4% have no analysis at all 3.5% do not contain the correct analysis

Unknown Words Distribution % of the unknowns % Missing Correct Analysis % No Analysis 62 36 26 Proper name 13.6 5.6 8 Closed set PoS 21.9 5.4 16.5 Other 2.5 Junk 100 47 53

Dealing with Unknowns Lexicon modifications (closed set corrections). Introduce a tag ‘Unknown’. Unknown distribution according to the re-tagged training corpus. About 50% of the unknown words were resolved.

Conclusions Introduce new text representation method to efficiently encode ambiguities produced by complex affix-based morphology Adapted HMM to the new text representation – unsupervised learning of tagging and segmentation in one step. Best results on full morphological disambiguation for Hebrew (88.5%) and for PoS and segmentation (92.3%)

Future Work Semi-supervised model (100K tagged words) Unknown words morphology Neologism Proper name recognizer Unknown word tags distribution Smoothing technique (currently [Thede and Harper 99] with extension of additive smoothing for lexical probabilities).