A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies.

A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies and Interactive Media University of Tampere, Finland

The goal of our work To create lemmatizer for low-resource languages Specifically for IR Effective Fast setup On par with gold standards in well established languages

Problem domain Morphological normalization is essential Morphologically complex languages Also a factor in less complex languages Word inflection causes problems Monolingual query-index mismatches Cross-lingual translation mismatches

Problem domain Lemmatization over stemming Less ambiguity in text-based IR Accurate token translation in CLIR

Lemmatization Several approaches, e.g. Dictionary-based methods  Internal dictionaries -> need for updates, OOV Corpus analyzation methods  Closed corpus -> must be trained for other corpora Pure rule-based methods  Probabilistic method -> precision loss

Lemmatization problems Out-of-vocabulary words (names, new words, loan words, etc.) Dictionary-based methods won’t work Probabilistic methods aren’t necessary precise

Lemmatizer problems Linguistically good lemmatizers Can be heavy Can be expensive Can produce more data than necessary

Simplify We only need effectiveness in IR Why use methods that do more than what we need them to? Why try to handle inflectional cases that have minimal effect in IR?

Experimental method: StaLe StaLe is a statistical, rule-based lemmatizer – also for OOV processing Two phases: one-time creation of the transformation rules for a given language, multi-time lemma generation for input words The training data set consisted of nouns only

StaLe Principle Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r# cf hens -> h# cf en -> e# cf # count cf confidence factor

Simple and Quick No internal dictionaries to setup Inflection rules from common vocabulary

Simple and Flexible Any language with inflection/derivation through affixes Knows how to lemmatize, but does not know the vocabulary

Simple and Dirty Probabilistic lemmatization Lemmatization recall over lemmatization precision ”Pseudo-lemmatization”

Simple and Strong On par with established methods in high- resource languages

A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies.

Similar presentations

Presentation on theme: "A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies.

Similar presentations

Presentation on theme: "A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies."— Presentation transcript:

Similar presentations

About project

Feedback