Download presentation
Presentation is loading. Please wait.
1
Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio
2
Unknown Words - English The draje of the tagement starts rikking with Befa. Morphology unknown wordanalysis 1 prob 1 … analysis n prob n analysis i prob i Syntax Motivation
3
Unknown Words - English The draje of the tagement starts rikking with Befa. Morphology tagement, rikking, Befa Syntax: The draje of the tagement starts rikking with Befa. Motivation
4
Unknowns Resolution Method in English Baseline Method PN tag for capitalized words Uniform distribution over open-class tags Evaluation 12 open-class tags 45% of capitalized unknowns Overall, 70% of the unknowns were tagged correctly Motivation
5
Unknowns Resolution Method in Hebrew The baseline method resolves: only 5% of the Hebrew unknown tokens! Why? How can we improve? Motivation
6
Unknown Words - Hebrew The draje of the tagement starts rikking with Befa. עם בפה דרג הטגמנט התחיל לנפן drj htgmnt hthil lnpn `m bfh Motivation
7
Unknown Words - Hebrew drj htgmnt hthil lngn `m bfh Morphology No capitalization: PN always candidate Many open-class tags (> 3,000) Syntax Unmarked function words preposition, definite article, conjunction, pronominal pronoun. the drj of drj Function words ambiguity htgmnt: VBP/VBI, DEF+MM `m: PREP (with), NN (people) bfh: PREP+NN/PN/JJ/VB, PREP+DEF+NN/JJ… Motivation
8
Outline Characteristics of Hebrew Unknowns Previous Work Unsupervised Lexicon-based Approaches Letters Model Pattern Model Linear-context Model Evaluation Conclusion
9
Hebrew Text Analysis System Hebrew Unknowns Characteristics Unknown Tokens Analyzer Tokenizer Morphological Analyzer Morphological Disambiguator Proper-name Classifier Named-Entity Recognizer Noun-Phrase Chunker SVM ME Lexicon SVM HMM ? http://www.cs.bgu.ac.il/~nlpproj/demo text tokenized text known words analysis distribution unknown words analysis distribution disambiguated text disambiguated text with PN
10
Hebrew Unknowns Unknown tokens Tokens which are not recognized by the lexicon NN: פרזנטור (presenter) VB: התחרדן (got warm under the sun) Unknown analyses The set of analyses suggested by the lexicon does not contain the correct analysis for a given token PN: שמעון פרס (Shimon Peres, that a dorm cut…) RB: לעומתי (opposition, compared with me) Hebrew Unknowns Characteristics
11
Hebrew Unknowns - Evidence Unknown Tokens (4%) Only 50% of the unknown tokens are PN Selection of default PN POS is not sufficient More than 30% of the unknown tokens are Neologism Neologism detection Unknown Analyses (3.5%) 60% of the unknown analyses are proper name Other POS cover 15% of the unknowns (only 1.1% of the tokens) PN classifier is sufficient for unknown analyses Hebrew Unknowns Characteristics
12
Hebrew Unknown Tokens Analysis Objectives Given an unknown token, extract all possible morphological analyses, and assign likelihood for each analysis Example: התחרדן (got warm in the sun) verb.singular.masculine.third.past0.6 Proper noun 0.2 noun.def.singular.masculine 0.1 noun.singular.masculine.absolute 0.05 noun.singular.masculine.construct 0.001 … Hebrew Unknowns Characteristics
13
Previous Work - English Heuristics [Weischedel et al. 95] Tag-specific heuristics Spelling features: capitalization, hyphens, suffixes Guessing rules learned from raw text [Mikheev 97] HMM with tag-suffix transitions [Thede & Harper 99] Previous Work
14
Previous Work - Arabic Root-pattern-features for morphological analysis and generation of Arabic dialects [Habash & Rambow 06] Combination of lexicon-based and character- based tagger [Mansour et al. 07] Previous Work
15
Our Approach Resources A large amount of unlabeled data (unsupervised) A comprehensive lexicon (lexicon-based) Hypothesis Characteristics of unknown tokens are similar to known tokens Method Tag distribution model, based on morphological analysis of the known tokens in the corpus: Letters model Pattern model Linear-context model Unsupervised Lexicon-based Approach
16
Notation Token A sequence of characters bounded with spaces בצל bcl Prefixes The prefixes according to each analysis Preposition+noun (under a shadow): ב b Base-word Token without prefixes (for each analysis) Noun (an onion) בצל bcl Preposition+noun (under a shadow): צל cl Unsupervised Lexicon-based Approach
17
Letters Model For each possible analyses of a given token: Features Positioned uni-, bi- and trigram letters of the base-word The prefixes of the base-word The length of the base-word Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Letters Model
18
Letters Model – An example Known token: בצל bcl Analyses An onion Base-word: bcl Features Grams: b:1 c:2 l:3 b:-3 c:-2 l:-1 bc:1 cl:2 bc:-2 cl:-1 bcl:1 bcl:-1 Prefix: none Length of base-word: 3 Value noun.singular.masculine.absolute Under a shadow Features Grams: c:1 l:2 c:-1 l:-2 cl:1 cl:-1 Prefix: b Length of base-word: 2 Value preposition+noun.singular.masculine.absolute Unsupervised Lexicon-based Approach
19
Pattern Model Word formation in Hebrew is based on root+template and/or affixation. Based on [Nir 93], we defined 40 common neologism formation patterns, e.g. Verb Template: miCCeC מחזר, tiCCeC תארך Noun Suffixation: ut שיפוטיות, iya בידוריה Template: tiCCoCet תפרוסת, miCCaCa מגננה Adjective Suffixation: ali סטודנטיאלי, oni טלויזיוני Adverb Suffixation: it לעומתית Unsupervised Lexicon-based Approach
20
Pattern Model For each possible analyses of a given token: Features For each pattern, 1 – if the token fits the pattern, 0- otherwise ‘no pattern’ feature Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Pattern Model Patterns
21
Letters+Pattern Model For each possible analyses of a given token: Features Letters features Patterns features Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Letters + Pattern Model Patterns
22
Linear-context Model The draje of the tagement starts rikking with Befa. P(t|w) is hard to estimate for unknown tokens P(noun| draje), P(adjective| draje), P(verb| draje) Alternatively, P(t|c), can be learned for known contexts P(noun| The, of), P(adjective| The, of), P(verb| The, of) Observed Context Information Lexical distribution Word given context P(w|c) - P(drage|The,of) Context given word P(c|w) - P(The, of | drage) Relative frequencies over all the words in the corpus Morpho-lexical distribution of known tokens P(t|w i ), - P(determiner|The)…, P(preposition|of)… Similar words alg. [Levinger et al. 95] [Adler 07] [Goldberg et al. 08] Unsupervised Lexicon-based Approach
23
Linear-context Model Expectation Unsupervised Lexicon-based Approach Notation: w – known word, c – context of a known word, t - tag Initial Conditions raw-text corpus p(w|c), p(c|w) lexicon p(t|w) Maximization p(t|c) = ∑ w p(t|w)p(w|c) p(t|w) = ∑ c p(t|c)p(c|w)
24
Evaluation Resources Lexicon: MILA Corpus Train: unlabeled 42M tokens corpus Test: annotated news articles of 90K token instances (3% unknown tokens, 2% unknown analyses) PN Classifier Evaluation
25
Evaluation - Models Baseline Most frequent tag - proper name - for all possible segmentations of the token Letters model Pattern model Letters + Pattern model Letters, Linear-context Pattern, Linear-context Letters + Pattern, Linear-context Evaluation
26
Evaluation - Criteria Suggested Analysis Set Coverage of correct analysis Ambiguity level (average number of analyses) Average probability of correct analysis Disambiguation accuracy Number of correct analyses, picked in the complete system
27
Evaluation – Full Morphological Analysis Full Morph Analysis SetModel ProbabilityAmbiguityCoverage 57.30.481.550.8Baseline 69.10.325.976.7Letters 66.80.120.482.8Pattern 69.80.2510.484.1Letters + Pattern 69.70.307.9480.7 Linear-context, Letters 66.50.1221.784.4 Linear-context, Pattern 68.80.241285.2Linear-context, Letters + Pattern Evaluation
28
Evaluation – Word Segmentation and POS Tagging SEG POS Analysis SetModel ProbabilityAmbiguityCoverage 60.60.521.552.9Baseline 77.60.39480Letters 760.198.787.4Pattern 78.50.326.286.7Letters+Pattern 78.20.374.583.8 Linear-context, Letters 75.80.218.888.7 Linear-context, Pattern 77.50.326.587.8Linear-context, Letter+Pattern Evaluation
29
Evaluation - Conclusion Error reduction > 30% over a competitive baseline, for a large-scale dataset of 90K tokens Full morphological disambiguation: 79% accuracy Word segmentation and POS tagging: 70% accuracy Unsupervised linear-context model is as effective as a model which uses hand-crafted patterns Effective combination of textual observation from unlabeled data and lexicon Effective combination of ME model for tag distribution and SVM model for PN classification Overall, error reduction of 5% for the whole disambiguation system Evaluation
30
Summary The characteristics of known words can help resolve unknown words Unsupervised (unlabeled data) lexicon-based approach Language independent algorithm for computing the distribution p(t|w) for unknown words Nature of agglutinated prefixes in Hebrew [Ben-Eliahu et al. 2008]
31
תנקס tnqs (thanks) foreign0.4 propernoun0.3 noun.plural.feminine.absolute0.2 verb.singular.feminine.3.future0.08 verb.singular.masculine.2.future0.02
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.