HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz, Krista.Lagus }@hut.fi International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05) Espoo, 17 June 2005 kahvi + n + juo + ja + lle + kin nyky + ratkaisu + i + sta + mme tietä + isi + mme + kö + hän open + mind + ed + ness un + believ + able

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 2 Challenge for NLP: too many words E.g., Finnish words often consist of lengthy sequences of morphemes — stems, suffixes and prefixes: –kahvi + n + juo + ja + lle + kin (coffee + of + drink + -er + for + also) –nyky + ratkaisu + i + sta + mme (current + solution + -s + from + our) –tietä + isi + mme + kö + hän (know + would + we + INTERR + indeed)  Huge number of different possible word forms  Important to know the inner structure of words  The number of morphemes per word varies much

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 3 Goal Learn representations of –the smallest individually meaningful units of language (morphemes) –and their interaction –in an unsupervised and data-driven manner from raw text –making as general and language-independent assumptions as possible. Morfessor

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 4 State of the art Rule-based systems –accurate, language-dependent, adaptivity issues Unsupervised word segmentation –sentences can be of different length –context-insensitive  poor modeling of syntax: undersegmentation of frequent strings (“forthepurposeof”) oversegmentation of rare strings (“in + s + an + e”) no syntactic / morphotactic constraints (“s + can”) Morfessor Baseline

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 5 State of the art (cont’d) Morphology learning –Beyond segmentation: allomorphy (“foot – feet, goose – geese”) –Detection of semantic similarity (e.g., Yarowsky & Wicentowski) (“sing – sings – singe – singed”) –Learning of paradigms (e.g., John Goldsmith’s Linguistica) believ hop liv mov us e ed es ing Very restricted syntax / morphotactics in terms of number of morphemes per word form!

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 6 Morfessor with morpheme categories Lexicon / Grammar dualism –Word structure captured by a regular expression: word = ( prefix* stem suffix* )+ –Morph sequences (words) are generated by a Hidden Markov model: P(STM | PRE)P(SUF | SUF) ificoverationsimpl#s# P(’s’ | SUF)P(’over’ | PRE) Transition probs Emission probs

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 7 Lexicon “Meaning”“Form” 1402913614over 41415simpl 17259146181s Frequency Length String... Right perplexity Left perplexity Morphs

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 8 How meaning affects morphotactic role Prior probability distributions for category membership of a morph, e.g., P(PRE | ’over’) Assume asymmetries between the categories:

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 9 How meaning affects role (cont’d) There is an additional non-morpheme category for cases where none of the proper classes is likely: Distribute remaining probability mass proportionally, e.g.,

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 10 Maximum a posteriori optimization Morfessor Categories-MAP: Older maximum- likelihood version: Categories-ML (lexicon controlled heuristically) 1402913614over 41415simpl 17259146181s... P(STM | PRE)P(SUF | SUF) ific overationsimpl# s # P(’s’ | SUF)P(’over’ | PRE) Balance accuracy of representation of data against size of lexicon

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 11 Over- and undersegmentation still a problem? Probability of adding an entry to the lexicon:  Rare strings are split into smaller parts (e.g., morgan + a) hands # #hand # #s Probability of sequences in the corpus: vs.  Frequent strings are left unsplit and their inner structure is “lost” (e.g., hands)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 12 Solution: Hierarchical structures in lexicon oppositiokansanedustaja+ oppositiokansanedustaja kansaedustajan Non-morphemeStem Suffix Make morphs consist of submorphs. Expand the tree when performing morpheme segmentation. Do not expand morphs consisting of non-morphemes.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 13 Evaluation using Hutmegs (Helsinki University of Technology Morphological Evaluation Gold Standard) Evaluate the segmentation of Morfessor against a linguistic morpheme segmentation = Hutmegs Covers –1.4 million Finnish word forms –120 000 English word forms Publicly available and described in the technical report: M. Creutz and K. Lindén. 2004. Morpheme Segmentation Gold Standards for Finnish and English. Publications in Computer and Information Science, Report A77, Helsinki University of Technology.

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 14 Evaluation against the Hutmegs Gold Standard FinnishEnglish Ctxt-insens. (Baseline) Paradigms (Linguistica) Heuristic (Categories-ML) Categories-MAP

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 15 Example segmentations FinnishEnglish [ aarre kammio ] issa[ accomplish es ] [ aarre kammio ] on[ accomplish ment ] bahama laiset[ beautiful ly ] bahama [ saari en ][ insur ed ] [ epä [ [ tasa paino ] inen ] ][ insure s ] maclare n[ insur ing ] [ nais [ autoili ja ] ] a[ [ [ photo graph ] er ] s ] [ sano ttiin ] ko[ present ly ] found töhri ( mis istä )[ re siding ] [ [ voi mme ] ko ][ [ un [ expect ed ] ] ly ]

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 16 Discussion Possibility to extend the model –rudimentary features used for “meaning” –more fine-grained categories –beyond concatenative phenomena (e.g., goose – geese) –allomorphy (e.g., beauty, beauty + ’s, beauti + es, beauti + ful) Already now useful in applications –automatic speech recognition (Finnish, Turkish)

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 17 Morpho project page http://www.cis.hut.fi/projects/morpho/

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 18 Demo 6 http://www.cis.hut.fi/projects/morpho/

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE 17 June 2005 Mathias Creutz 19 Demo 7

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,

Similar presentations

Presentation on theme: "HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,

Similar presentations

Presentation on theme: "HELSINKI UNIVERSITY OF TECHNOLOGY NEURAL NETWORKS RESEARCH CENTRE Inducing the Morphological Lexicon of a Natural Language from Unannotated Text { Mathias.Creutz,"— Presentation transcript:

Similar presentations

About project

Feedback