A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi.

A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi Soudi, and Sander Canisius ILK / Language and Information Sciences Dept. Tilburg University, The Netherlands FLaVoR Workshop, 17 November 2006, Leuven

Learning plus inference Paradigmatic solution to natural language processing tasks Decomposition: –The disambiguation of local, elemental ambiguities in context –A holistic, global coordination of local decisions over the entire sequence

Learning plus inference Example: grapheme-phoneme conversion Local decisions –The mapping of a vowel letter in context to a vowel phoneme with primary stress Global coordination –Making sure that there is only one primary stress

Learning plus inference Example: dependency parsing Local decisions –The relation between a noun and a verb is of the “subject” type Global coordination –The verb only has one subject relation

Learning plus inference Example: named entity recognition Local decisions –A name that can be a location or a person, is a location in this context Global coordination –Everywhere in the text this name always refers to the location

Learning plus inference Local decision making by learning –All NLP decisions can be recast as classification tasks (Daelemans, 1996: segmentation or identification) Global coordination by inference –Given local proposals that may conflict, find the best overall solution (e.g. minimizing conflict, or adhering to language model) Collins and colleagues; Manning and Klein and colleagues; Dan Roth & colleagues; Marquez and Carreras; etc.

L+I and morphology Segmentation boundaries, spelling changes, and PoS tagging recast as classification Global inference checks for –Noun stem followed by noun inflection –Infix in a noun-noun compound is surrounded by two nouns –Etc.

Talk overview English morphological segmentation –Easy learning –Inference not really needed Dutch morphological analysis –Learning operations rather than simple decisions –Reasonably complex inference Arabic morphological analysis –Learning as an attempt at lowering the massive ambiguity –Inference as an attempt to separate the chaff from the grain

English segmentation (Van den Bosch, Daelemans, Weijters, NeMLaP 1996) Morphological segmentation as classification Versus traditional approach: –E.g. Mitalk’s DECOMP, analysing scarcity: First analysis: scar|city - both stems found in morpheme lexicon, and validated as a possible analysis Second analysis: scarc|ity - stem scarce found due to application of e-deletion rule; suffix -ity found; validated as a possible analysis Cost-based heuristic prefers stem|derivation over stem|stem Ingredients: morpheme lexicons, finite state analysis validator, spelling changing rules, cost heuristics –Validator, rules, and cost heuristics are costly knowledge- based resources

English segmentation Segmentations as local decisions –To segment or not to segment –If segment, identify start (or end) of Stem Affixes Inflectional morpheme

English segmentation Three tasks: given a letter in context, is it the start of –a segment or not –a derivational morpheme (stem or affix), inflection, or not –a stem, a stress-affecting affix, a stress- neutral affix, an inflection, or not

English segmentation

Local classification Memory-based learning –k-nearest neighbor classification –(Daelemans & Van den Bosch, 2005) E.g. instance # 9 –m a l i t i e ? Nearest neighbors: a lot of evidence for “2”: Instancedistanceclones m a l i t i e 202x t a l i t i e 213x u a l i t i e 212x i a l i t i e 2111x g a l i t i e 212x n a l i t i e 217x r a l i t i e 215x c a l i t i e 217x p a l i t i e 21 2x h a l i t i c s21x …

Memory-based learning Similarity function: X and Y are instances n is the number of features x i is the value of the i th feature of X w i is the weight of the i th feature

Similarity function components

Generalizing lexicon A memory-based morphological analyzer is –A lexicon: 100% accurate reconstruction of all examples in training material –At the same time, capable of processing unseen words In essence, unseen words are the only problem remaining –CELEX Dutch has +300k words; average coverage of text is 90%-95% –Evaluation should focus solely on unseen words –So, a held-out test from CELEX is fairly representative of unseen words

Experiments CELEX English –65,558 segmented words –573,544 instances 10-fold cross-validation –Measuring accuracy –M1: 88.0% correct test words –M2: 85.6% correct test words –M3: 82.4% correct test words

Add inference (Van den Bosch and Canisius, SIGPHON 2006) Original approach: only learning Now: inference –Constraint satisfaction inference –Based on Van den Bosch and Daelemans (CoNLL 2005) trigram prediction

Constraint satisfaction inference Predict trigrams, and use them as complete as possible Formulate the inference procedure as a constraint satisfaction problem Constraint satisfaction –Assigning values to a number of variables while satisfying certain predefined constraints Constraint satisfaction for inference –Each token maps to a variable, the domain of which corresponds to the three candidate labels –Constraints are derived from the predicted trigrams

_hh { ha{ n {nnt n dd_ inputoutput Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{h,a → h,{ a,n → {,na,n → {,n n,d → n,tn,d → n,d Unigram constraints h → hh → h a → {a → {a → { n → nn → nn → n d → td → d (1) (2) (3) (4) Constraint satisfaction inference

_hh { ha{ n {nnt n dd_ inputoutput Trigram constraints h,a,n → h,{,n a,n,d → {,n,t Bigram constraints h,a → h,{h,a → h,{ a,n → {,na,n → {,n n,d → n,tn,d → n,d Unigram constraints h → hh → h a → {a → {a → { n → nn → nn → n d → td → d (1) (2) (3) (4) Conflictingconstraints Constraint satisfaction inference

Weighted constraint satisfaction Extension of constraint satisfaction to deal with overconstrainedness –Each constraint has a weight associated to it –Optimal solution assigns those values to the variables that optimise the sum of weights of the constraints that are satisfied For constrained satisfaction inference, a constraint's weight should reflect the classifier's confidence in its correctness

Example instances Left focus right uni tri _ _ _ _ _ a b n o r m 2-20 _ _ _ _ a b n o r m a 020s _ _ _ a b n o r m a l s0s0 _ _ a b n o r m a l i 0s00 _ a b n o r m a l i t 0000 a b n o r m a l i t i 0000 b n o r m a l i t i e 0000 n o r m a l i t i e s 0001 o r m a l i t i e s _ 1010 r m a l i t i e s _ _ 0100 m a l i t i e s _ _ _ 0000 a l i t i e s _ _ _ _ 000i l i t i e s _ _ _ _ _ i0i-

Results Only learning: –M3: 82.4% correct unseen words Learning + CSI: –M3: 85.4% correct unseen words Mild effect.

Dutch morphological analysis (Van den Bosch & Daelemans, 1999; Van den Bosch & Canisius, 2006) Task expanded to –Spelling changes –Part-of-speech tagging –Analysis generation Dutch is mildly productive –Compounding –A bit more inflection than in English –Infixes, diminutives, …

Dutch morphological analysis Left focus right uni tri _ _ _ _ _ a b n o r m A -A0 _ _ _ _ a b n o r m a 0A00 _ _ _ a b n o r m a l 0000 _ _ a b n o r m a l i 0000 _ a b n o r m a l i t 0000 a b n o r m a l i t e 0 000 b n o r m a l i t e i 000+Da n o r m a l i t e i t +Da0+DaA_->N o r m a l i t e i t e A_->N+DaA_->N0 r m a l i t e i t e n 0A_->N00 m a l i t e i t e n _ 0000 a l i t e i t e n _ _ 0000 l i t e i t e n _ _ _ 000plural i t e i t e n _ _ _ _ plural0plural0 t e i t e n _ _ _ _ _ 0plural0-

Spelling changes Deletion, insertion, replacement b n o r m a l i t e i 0 n o r m a l i t e i t +Da o r m a l i t e i t e A_->N abnormaliteiten analyzed as [[abnormaal] A iteit] N [en] plural Root form has double a, wordform drops one a

Part-of-speech Selection processes in derivation n o r m a l i t e i t +Da o r m a l i t e i t e A_->N r m a l i t e i t e n 0 Stem abnormaal is an adjective; Affix -iteit seeks an adjective to its left to turn it into a noun

Experiments CELEX Dutch: –336,698 words –3,209,090 instances 10-fold cross validation Learning only: 41.3% correct unseen words With CSI: 51.9% correct unseen words Useful improvement

Arabic analysis (Marsi, Van den Bosch, and Soudi, 2005)

Arabic analysis

Problem of undergeneration and overgeneration of analyses Undergeneration: at k=1, –7 out of 10 analyses of unknown words are correct, but –4 out of 5 of the real analyses are not generated Overgeneration: at k=10, –Only 3 out of 5 are missed, but –Half of the generated analyses is incorrect Harmony at k=3 (F-score 0.42)

Discussion (1) Memory-based morphological analysis –Lexicon and analyzer in one –Extremely simple algorithm Unseen words are the remaining problem Learning: local classifications –From simple boundary decisions –To complex operations –And trigrams Inference: –More complex morphologies need more inference effort

Discussion (2) Ceiling not reached yet; good solutions still wanted –Particularly for unknown words with unknown stems –Also, recent work by De Pauw! External evaluation needed –Integration with part-of-speech tagging (software packages forthcoming) –Effect on IR, IE, QA –Effect in ASR

Thank you. http://ilk.uvt.nl Antal.vdnBosch@uvt.nl

A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi.

Similar presentations

Presentation on theme: "A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi.

Similar presentations

Presentation on theme: "A memory-based learning-plus-inference approach to morphological analysis Antal van den Bosch With Walter Daelemans, Ton Weijters, Erwin Marsi, Abdelhadi."— Presentation transcript:

Similar presentations

About project

Feedback