Presentation is loading. Please wait.

Presentation is loading. Please wait.

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

Similar presentations


Presentation on theme: "Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006"— Presentation transcript:

1 Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

2 unsprvsd mrphm-bsd hmm frhbrw mrflgcl dsmbgtn
mni adlr andmchl elhdd bn grn unvrst clng-acl ht$s”b

3 Hebrew Unvocalized writing inherent  inhrnt Affixation
in her note  inhrnt in her net  inhrnt Rich Morphology ‘inhrnt’ could be inflected into different forms according to sing/pl, masc/fem properties. Some morphological properties alternations could also leave ‘inherent’ unmodified (construct/absolute).

4 Ambiguity Level Size of Tagset
These variations create a high level of ambiguity: English lexicon: inherent  inherent.adj With Hebrew word formation rules: inhrnt  in.prep her.pro.fem.poss note.noun  in.prep her.pro.fem net.noun  inherent.adj.masc.absolute  inherent.adj.masc.construct Size of Tagset Hebrew: Theoretically: ~300K, In practice: ~2K English: tags Number of possible morphological analyses per word instance: English: 1.4 (Average # words / sentence: 12) Hebrew: 2.4 (Average # words / sentence: 18)

5 Stochastic English Taggers
Supervised: 97% Semi-supervised transformation based: 95% Unsupervised HMM: 75%-86% Unsupervised HMM with good initial conditions: ~94% [Elworthy 95] Supervised HMM with only 100 tagged sentences: ~92% [Merialdo 95]

6 Hebrew Taggers [Levinger et al. 95] Context-free approximation of morpho-lexical distributions, based on similar word set. 88% reported – 78% tested. [Levinger 94] An expert system, based on manual set of 16 syntactic constraints. 94% accuracy for 85% of the words. [Carmel & Maarek 99] Disambiguation of lemma and part-of-speech. 75% - one analysis with 95% accuracy 20% - 2 analyses 5% - 3 analyses [Segal 2000] Pseudo rule-based transformations method, supervised on a corpus of 5K words. 95% reported – 85% tested [Bar-Haim et al. 2005] Morpheme-based HMM over segmented text, supervised by a corpus of 60K % for segmentation and PoS tagging.

7 Arabic vs. Hebrew Similar Morphology
Rich morphology, affixation, unvocalized writing 2,200 tags Average number of 2 analyses per word Selection of most frequent tag for a word Arabic: ~92% (Diab 2004, Habash & Rambow 2005) Hebrew: 72%

8 Arabic Tagging [Diab et al. 2004] used training corpus % on PoS tagging, 92% baseline. [Habash, Rambow 2005] Supervised morphological classifiers based on two sets of 120K words, accuracy: 94.8% %. [Duh, Kirchhoff 2005] word-based HMM for PoS tagging of dialectal Arabic. Unsupervised: 69.83%, Supervised: 92.53%

9 Unsupervised Model Stochastic model, unsupervised learning, exact inference. Motivation Not enough available data for supervised training. Dynamic nature of Modern Hebrew, as it evolves over time (20% new lexemes in a 10 year period). Expectations Larger Training Corpus helps. Good initial conditions help. Small amount of annotated data helps.

10 First-order word-based HMM
inhrnt txt Ti Ti+1 T1 prep + pos + noun.fem.sing.cons T2 noun.masc.sing.abs

11 First-order word-based HMM
inhrnt txt Ti Ti+1 Tags number: 1934 State Transitions: ~250K Lexical Transitions: ~300K

12 Partial second-order word-based HMM
inhrnt txt Ti Ti+1 Ti-1 Tags number: 1934 State Transitions: ~7M Lexical Transitions: ~300K

13 Second-order word-based HMM
inhrnt txt Ti Ti+1 Ti-1 Tags number: 1934 State Transitions: ~7M Lexical Transitions: ~5M

14 Research Hypothesis The large set of morphological features should be modeled in a compact morpheme model. Morphemes segmentation and tagging should be learned/searched in parallel (in contrast to [Bar-Haim et al. 2005]).

15 First-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Ti prep Ti+1 pos pronoun Ti+2 noun.fem.fem.cons Ti+3 noun.masc.sing.abs

16 First-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~20K Lexical Transitions number: ~130K

17 Partial second-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~700K Lexical Transitions number: ~130K

18 Second-order morpheme-based HMM
in hr txt nt Ti+3 Ti+2 Ti+1 Ti Tags number: 202 State Transitions number: ~700K Lexical Transitions number: ~1.7M

19 Model Sizes B2 B A2 A PI States 5M 300K 7M 250K 834 1934 W 1.7M 130K
145 202 M

20 Agglutination of the observed Morphemes
in hr nt txt

21 Agglutination of the observed Morphemes
inhrnt txt

22 Text Representation of inhrnt txt
Tag Segmentation Word adj.masc.sing.abs inhrnt adj.masc.sing.cons prep+pos+noun.fem.sing.cons in-hr-nt prep+pos+noun.masc.sing.cons noun.masc.sing.abs txt noun.masc.sing.cons

23 Text Representation of inhrnt txt
EOS 15 nt 13 nt 11 EOS 17 EOS 16

24 Multi Words Expression
arrive 7 in time 10 in 5 EOS 16 time 11 EOS 17

25 Learning and Searching
The learning and searching algorithms were adapted in order to support the new text representation. The complexity of the algorithms is O(T’) where T’ is the number of transitions in the sentence in the new representation.

26 The Training Corpus Daily news
About 6M words 178,580 different words 64,541 different lexemes Average number of analyses per word: 2.4 Initial morpho-lexical probabilities according to [Levinger, Ornan, Ittai 95]

27 Morphological Disambiguation
Context Free Uniform Order Model Type 84.08 82.01 1 W 85.75 80.44 2- 85.78 79.88 2 84.54 81.08 M 88.5 81.53 85.83 83.39

28 Analysis Baseline: 78.2% (Levinger et al – similar words)
Error reduction Contextual information: 17.5% (78.2  82.01). Initial conditions: 11.5% % (82.01  84.08 for w model 1, and  88.5 for m model 2-) Model order: 2- produced the best results for both word (85.75%) and morpheme (88.5%) models. Model type: morpheme model reduced about 19.3% of the errors (85.75  88.5)

29 Segmentation and PoS Tagging
CF Uniform Order Model Type 91.47 91.07 1 W 91.93 90.45 2- 91.84 90.21 2 91.42 89.23 M 91.76 89.77 92.32

30 Confusion Matrix % of errors Error Correct 17.9 noun proper name 15.3
verb 6.6 6.3 5.4 adjective 5.0

31 Unknown Words 20% of the word types in the training corpus have no analysis found by the analyzer. 7.5% of the word instances of the test corpus (30K words) do not have a correct analysis proposed by the analyzer: 4% have no analysis at all 3.5% do not contain the correct analysis

32 Unknown Words Distribution
% of the unknowns % Missing Correct Analysis % No Analysis 62 36 26 Proper name 13.6 5.6 8 Closed set PoS 21.9 5.4 16.5 Other 2.5 Junk 100 47 53

33 Dealing with Unknowns Lexicon modifications (closed set corrections).
Introduce a tag ‘Unknown’. Unknown distribution according to the re-tagged training corpus. About 50% of the unknown words were resolved.

34 Conclusions Introduce new text representation method to efficiently encode ambiguities produced by complex affix-based morphology Adapted HMM to the new text representation – unsupervised learning of tagging and segmentation in one step. Best results on full morphological disambiguation for Hebrew (88.5%) and for PoS and segmentation (92.3%)

35 Future Work Semi-supervised model (100K tagged words)
Unknown words morphology Neologism Proper name recognizer Unknown word tags distribution Smoothing technique (currently [Thede and Harper 99] with extension of additive smoothing for lexical probabilities).


Download ppt "Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006"

Similar presentations


Ads by Google