Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion.

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion University ACL 2008, Columbus, Ohio

Unknown Words - English The draje of the tagement starts rikking with Befa. Morphology unknown wordanalysis 1 prob 1 … analysis n prob n analysis i prob i Syntax Motivation

Unknown Words - English The draje of the tagement starts rikking with Befa.  Morphology tagement, rikking, Befa  Syntax: The draje of the tagement starts rikking with Befa. Motivation

Unknowns Resolution Method in English Baseline Method  PN tag for capitalized words  Uniform distribution over open-class tags Evaluation  12 open-class tags  45% of capitalized unknowns  Overall, 70% of the unknowns were tagged correctly Motivation

Unknowns Resolution Method in Hebrew The baseline method resolves: only 5% of the Hebrew unknown tokens! Why? How can we improve? Motivation

Unknown Words - Hebrew The draje of the tagement starts rikking with Befa. עם בפה דרג הטגמנט התחיל לנפן drj htgmnt hthil lnpn `m bfh Motivation

Unknown Words - Hebrew drj htgmnt hthil lngn `m bfh  Morphology No capitalization: PN always candidate Many open-class tags (> 3,000)  Syntax Unmarked function words  preposition, definite article, conjunction, pronominal pronoun.  the drj of  drj Function words ambiguity  htgmnt: VBP/VBI, DEF+MM  `m: PREP (with), NN (people)  bfh: PREP+NN/PN/JJ/VB, PREP+DEF+NN/JJ… Motivation

Outline Characteristics of Hebrew Unknowns Previous Work Unsupervised Lexicon-based Approaches  Letters Model  Pattern Model  Linear-context Model Evaluation Conclusion

Hebrew Text Analysis System Hebrew Unknowns Characteristics Unknown Tokens Analyzer Tokenizer Morphological Analyzer Morphological Disambiguator Proper-name Classifier Named-Entity Recognizer Noun-Phrase Chunker SVM ME Lexicon SVM HMM ? http://www.cs.bgu.ac.il/~nlpproj/demo text tokenized text known words analysis distribution unknown words analysis distribution disambiguated text disambiguated text with PN

Hebrew Unknowns Unknown tokens  Tokens which are not recognized by the lexicon NN: פרזנטור (presenter) VB: התחרדן (got warm under the sun) Unknown analyses  The set of analyses suggested by the lexicon does not contain the correct analysis for a given token PN: שמעון פרס (Shimon Peres, that a dorm cut…) RB: לעומתי (opposition, compared with me) Hebrew Unknowns Characteristics

Hebrew Unknowns - Evidence Unknown Tokens (4%)  Only 50% of the unknown tokens are PN  Selection of default PN POS is not sufficient  More than 30% of the unknown tokens are Neologism  Neologism detection Unknown Analyses (3.5%)  60% of the unknown analyses are proper name  Other POS cover 15% of the unknowns (only 1.1% of the tokens)  PN classifier is sufficient for unknown analyses Hebrew Unknowns Characteristics

Hebrew Unknown Tokens Analysis Objectives  Given an unknown token, extract all possible morphological analyses, and assign likelihood for each analysis  Example: התחרדן (got warm in the sun)  verb.singular.masculine.third.past0.6  Proper noun 0.2  noun.def.singular.masculine 0.1  noun.singular.masculine.absolute 0.05  noun.singular.masculine.construct 0.001  … Hebrew Unknowns Characteristics

Previous Work - English Heuristics [Weischedel et al. 95] Tag-specific heuristics Spelling features: capitalization, hyphens, suffixes Guessing rules learned from raw text [Mikheev 97] HMM with tag-suffix transitions [Thede & Harper 99] Previous Work

Previous Work - Arabic Root-pattern-features for morphological analysis and generation of Arabic dialects [Habash & Rambow 06] Combination of lexicon-based and character- based tagger [Mansour et al. 07] Previous Work

Our Approach Resources  A large amount of unlabeled data (unsupervised)  A comprehensive lexicon (lexicon-based) Hypothesis  Characteristics of unknown tokens are similar to known tokens Method  Tag distribution model, based on morphological analysis of the known tokens in the corpus: Letters model Pattern model Linear-context model Unsupervised Lexicon-based Approach

Notation Token  A sequence of characters bounded with spaces בצל bcl Prefixes  The prefixes according to each analysis Preposition+noun (under a shadow): ב b Base-word  Token without prefixes (for each analysis) Noun (an onion) בצל bcl Preposition+noun (under a shadow): צל cl Unsupervised Lexicon-based Approach

Letters Model For each possible analyses of a given token:  Features Positioned uni-, bi- and trigram letters of the base-word The prefixes of the base-word The length of the base-word  Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Letters Model

Letters Model – An example Known token: בצל bcl Analyses  An onion Base-word: bcl Features  Grams: b:1 c:2 l:3 b:-3 c:-2 l:-1 bc:1 cl:2 bc:-2 cl:-1 bcl:1 bcl:-1  Prefix: none  Length of base-word: 3 Value  noun.singular.masculine.absolute  Under a shadow Features  Grams: c:1 l:2 c:-1 l:-2 cl:1 cl:-1  Prefix: b  Length of base-word: 2 Value  preposition+noun.singular.masculine.absolute Unsupervised Lexicon-based Approach

Pattern Model Word formation in Hebrew is based on root+template and/or affixation. Based on [Nir 93], we defined 40 common neologism formation patterns, e.g.  Verb Template: miCCeC מחזר, tiCCeC תארך  Noun Suffixation: ut שיפוטיות, iya בידוריה Template: tiCCoCet תפרוסת, miCCaCa מגננה  Adjective Suffixation: ali סטודנטיאלי, oni טלויזיוני  Adverb Suffixation: it לעומתית Unsupervised Lexicon-based Approach

Pattern Model For each possible analyses of a given token:  Features For each pattern, 1 – if the token fits the pattern, 0- otherwise ‘no pattern’ feature  Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Pattern Model Patterns

Letters+Pattern Model For each possible analyses of a given token:  Features Letters features Patterns features  Value A full set of the morphological properties (as given by the lexicon) Unsupervised Lexicon-based Approach Raw-text corpus Lexicon ME Letters + Pattern Model Patterns

Linear-context Model The draje of the tagement starts rikking with Befa. P(t|w) is hard to estimate for unknown tokens  P(noun| draje), P(adjective| draje), P(verb| draje) Alternatively, P(t|c), can be learned for known contexts  P(noun| The, of), P(adjective| The, of), P(verb| The, of) Observed Context Information  Lexical distribution Word given context P(w|c) - P(drage|The,of) Context given word P(c|w) - P(The, of | drage) Relative frequencies over all the words in the corpus  Morpho-lexical distribution of known tokens P(t|w i ), - P(determiner|The)…, P(preposition|of)… Similar words alg. [Levinger et al. 95] [Adler 07] [Goldberg et al. 08] Unsupervised Lexicon-based Approach

Linear-context Model Expectation Unsupervised Lexicon-based Approach Notation: w – known word, c – context of a known word, t - tag Initial Conditions raw-text corpus p(w|c), p(c|w) lexicon p(t|w) Maximization p(t|c) = ∑ w p(t|w)p(w|c) p(t|w) = ∑ c p(t|c)p(c|w)

Evaluation Resources  Lexicon: MILA  Corpus Train: unlabeled 42M tokens corpus Test: annotated news articles of 90K token instances (3% unknown tokens, 2% unknown analyses)  PN Classifier Evaluation

Evaluation - Models Baseline  Most frequent tag - proper name - for all possible segmentations of the token Letters model Pattern model Letters + Pattern model Letters, Linear-context Pattern, Linear-context Letters + Pattern, Linear-context Evaluation

Evaluation - Criteria Suggested Analysis Set  Coverage of correct analysis  Ambiguity level (average number of analyses)  Average probability of correct analysis Disambiguation accuracy  Number of correct analyses, picked in the complete system

Evaluation – Full Morphological Analysis Full Morph Analysis SetModel ProbabilityAmbiguityCoverage 57.30.481.550.8Baseline 69.10.325.976.7Letters 66.80.120.482.8Pattern 69.80.2510.484.1Letters + Pattern 69.70.307.9480.7 Linear-context, Letters 66.50.1221.784.4 Linear-context, Pattern 68.80.241285.2Linear-context, Letters + Pattern Evaluation

Evaluation – Word Segmentation and POS Tagging SEG POS Analysis SetModel ProbabilityAmbiguityCoverage 60.60.521.552.9Baseline 77.60.39480Letters 760.198.787.4Pattern 78.50.326.286.7Letters+Pattern 78.20.374.583.8 Linear-context, Letters 75.80.218.888.7 Linear-context, Pattern 77.50.326.587.8Linear-context, Letter+Pattern Evaluation

Evaluation - Conclusion Error reduction > 30% over a competitive baseline, for a large-scale dataset of 90K tokens  Full morphological disambiguation: 79% accuracy  Word segmentation and POS tagging: 70% accuracy Unsupervised linear-context model is as effective as a model which uses hand-crafted patterns Effective combination of textual observation from unlabeled data and lexicon Effective combination of ME model for tag distribution and SVM model for PN classification Overall, error reduction of 5% for the whole disambiguation system Evaluation

Summary The characteristics of known words can help resolve unknown words Unsupervised (unlabeled data) lexicon-based approach Language independent algorithm for computing the distribution p(t|w) for unknown words Nature of agglutinated prefixes in Hebrew [Ben-Eliahu et al. 2008]

תנקס tnqs (thanks)  foreign0.4  propernoun0.3  noun.plural.feminine.absolute0.2  verb.singular.feminine.3.future0.08  verb.singular.masculine.2.future0.02

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion.

Similar presentations

Presentation on theme: "Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion.

Similar presentations

Presentation on theme: "Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis Meni Adler, Yoav Goldberg, David Gabay, Michael Elhadad Ben-Gurion."— Presentation transcript:

Similar presentations

About project

Feedback