Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

unsprvsd mrphm-bsd hmm frhbrw mrflgcl dsmbgtn
mni adlr andmchl elhdd bn grn unvrst clng-acl ht$s”b

Hebrew Unvocalized writing inherent  inhrnt Affixation
in her note  inhrnt in her net  inhrnt Rich Morphology ‘inhrnt’ could be inflected into different forms according to sing/pl, masc/fem properties. Some morphological properties alternations could also leave ‘inherent’ unmodified (construct/absolute).

Ambiguity Level Size of Tagset
These variations create a high level of ambiguity: English lexicon: inherent  inherent.adj With Hebrew word formation rules: inhrnt  in.prep her.pro.fem.poss note.noun  in.prep her.pro.fem net.noun  inherent.adj.masc.absolute  inherent.adj.masc.construct Size of Tagset Hebrew: Theoretically: ~300K, In practice: ~2K English: tags Number of possible morphological analyses per word instance: English: 1.4 (Average # words / sentence: 12) Hebrew: 2.4 (Average # words / sentence: 18)

Stochastic English Taggers
Supervised: 97% Semi-supervised transformation based: 95% Unsupervised HMM: 75%-86% Unsupervised HMM with good initial conditions: ~94% [Elworthy 95] Supervised HMM with only 100 tagged sentences: ~92% [Merialdo 95]

Hebrew Taggers [Levinger et al. 95] Context-free approximation of morpho-lexical distributions, based on similar word set. 88% reported – 78% tested. [Levinger 94] An expert system, based on manual set of 16 syntactic constraints. 94% accuracy for 85% of the words. [Carmel & Maarek 99] Disambiguation of lemma and part-of-speech. 75% - one analysis with 95% accuracy 20% - 2 analyses 5% - 3 analyses [Segal 2000] Pseudo rule-based transformations method, supervised on a corpus of 5K words. 95% reported – 85% tested [Bar-Haim et al. 2005] Morpheme-based HMM over segmented text, supervised by a corpus of 60K % for segmentation and PoS tagging.

Arabic vs. Hebrew Similar Morphology
Rich morphology, affixation, unvocalized writing 2,200 tags Average number of 2 analyses per word Selection of most frequent tag for a word Arabic: ~92% (Diab 2004, Habash & Rambow 2005) Hebrew: 72%

Arabic Tagging [Diab et al. 2004] used training corpus % on PoS tagging, 92% baseline. [Habash, Rambow 2005] Supervised morphological classifiers based on two sets of 120K words, accuracy: 94.8% %. [Duh, Kirchhoff 2005] word-based HMM for PoS tagging of dialectal Arabic. Unsupervised: 69.83%, Supervised: 92.53%

Unsupervised Model Stochastic model, unsupervised learning, exact inference. Motivation Not enough available data for supervised training. Dynamic nature of Modern Hebrew, as it evolves over time (20% new lexemes in a 10 year period). Expectations Larger Training Corpus helps. Good initial conditions help. Small amount of annotated data helps.

First-order word-based HMM
inhrnt txt Ti Ti+1 T1 prep + pos + noun.fem.sing.cons T2 noun.masc.sing.abs

First-order word-based HMM
inhrnt txt Ti Ti+1 Tags number: 1934 State Transitions: ~250K Lexical Transitions: ~300K

Partial second-order word-based HMM
inhrnt txt Ti Ti+1 Ti-1 Tags number: 1934 State Transitions: ~7M Lexical Transitions: ~300K

Second-order word-based HMM
inhrnt txt Ti Ti+1 Ti-1 Tags number: 1934 State Transitions: ~7M Lexical Transitions: ~5M

Research Hypothesis The large set of morphological features should be modeled in a compact morpheme model. Morphemes segmentation and tagging should be learned/searched in parallel (in contrast to [Bar-Haim et al. 2005]).

First-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Ti prep Ti+1 pos pronoun Ti+2 noun.fem.fem.cons Ti+3 noun.masc.sing.abs

First-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~20K Lexical Transitions number: ~130K

Partial second-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~700K Lexical Transitions number: ~130K

Second-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~700K Lexical Transitions number: ~1.7M

Model Sizes B2 B A2 A PI States 5M 300K 7M 250K 834 1934 W 1.7M 130K
145 202 M

Agglutination of the observed Morphemes
in hr nt txt

Agglutination of the observed Morphemes
inhrnt txt

Text Representation of inhrnt txt
Tag Segmentation Word adj.masc.sing.abs inhrnt adj.masc.sing.cons prep+pos+noun.fem.sing.cons in-hr-nt prep+pos+noun.masc.sing.cons noun.masc.sing.abs txt noun.masc.sing.cons

Text Representation of inhrnt txt
EOS 15 nt 13 nt 11 EOS 17 EOS 16

Multi Words Expression
arrive 7 in time 10 in 5 EOS 16 time 11 EOS 17

Learning and Searching
The learning and searching algorithms were adapted in order to support the new text representation. The complexity of the algorithms is O(T’) where T’ is the number of transitions in the sentence in the new representation.

The Training Corpus Daily news
About 6M words 178,580 different words 64,541 different lexemes Average number of analyses per word: 2.4 Initial morpho-lexical probabilities according to [Levinger, Ornan, Ittai 95]

Morphological Disambiguation
Context Free Uniform Order Model Type 84.08 82.01 1 W 85.75 80.44 2- 85.78 79.88 2 84.54 81.08 M 88.5 81.53 85.83 83.39

Analysis Baseline: 78.2% (Levinger et al – similar words)
Error reduction Contextual information: 17.5% (78.2  82.01). Initial conditions: 11.5% % (82.01  84.08 for w model 1, and  88.5 for m model 2-) Model order: 2- produced the best results for both word (85.75%) and morpheme (88.5%) models. Model type: morpheme model reduced about 19.3% of the errors (85.75  88.5)

Segmentation and PoS Tagging
CF Uniform Order Model Type 91.47 91.07 1 W 91.93 90.45 2- 91.84 90.21 2 91.42 89.23 M 91.76 89.77 92.32

Confusion Matrix % of errors Error Correct 17.9 noun proper name 15.3
verb 6.6 6.3 5.4 adjective 5.0

Unknown Words 20% of the word types in the training corpus have no analysis found by the analyzer. 7.5% of the word instances of the test corpus (30K words) do not have a correct analysis proposed by the analyzer: 4% have no analysis at all 3.5% do not contain the correct analysis

Unknown Words Distribution
% of the unknowns % Missing Correct Analysis % No Analysis 62 36 26 Proper name 13.6 5.6 8 Closed set PoS 21.9 5.4 16.5 Other 2.5 Junk 100 47 53

Dealing with Unknowns Lexicon modifications (closed set corrections).
Introduce a tag ‘Unknown’. Unknown distribution according to the re-tagged training corpus. About 50% of the unknown words were resolved.

Conclusions Introduce new text representation method to efficiently encode ambiguities produced by complex affix-based morphology Adapted HMM to the new text representation – unsupervised learning of tagging and segmentation in one step. Best results on full morphological disambiguation for Hebrew (88.5%) and for PoS and segmentation (92.3%)

Future Work Semi-supervised model (100K tagged words)
Unknown words morphology Neologism Proper name recognizer Unknown word tags distribution Smoothing technique (currently [Thede and Harper 99] with extension of additive smoothing for lexical probabilities).

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

Similar presentations

Presentation on theme: "Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

Similar presentations

Presentation on theme: "Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006"— Presentation transcript:

Similar presentations

About project

Feedback