A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies and Interactive Media University of Tampere, Finland
The goal of our work To create lemmatizer for low-resource languages Specifically for IR Effective Fast setup On par with gold standards in well established languages
Problem domain Morphological normalization is essential Morphologically complex languages Also a factor in less complex languages Word inflection causes problems Monolingual query-index mismatches Cross-lingual translation mismatches
Problem domain Lemmatization over stemming Less ambiguity in text-based IR Accurate token translation in CLIR
Lemmatization Several approaches, e.g. Dictionary-based methods Internal dictionaries -> need for updates, OOV Corpus analyzation methods Closed corpus -> must be trained for other corpora Pure rule-based methods Probabilistic method -> precision loss
Lemmatization problems Out-of-vocabulary words (names, new words, loan words, etc.) Dictionary-based methods won’t work Probabilistic methods aren’t necessary precise
Lemmatizer problems Linguistically good lemmatizers Can be heavy Can be expensive Can produce more data than necessary
Simplify We only need effectiveness in IR Why use methods that do more than what we need them to? Why try to handle inflectional cases that have minimal effect in IR?
Experimental method: StaLe StaLe is a statistical, rule-based lemmatizer – also for OOV processing Two phases: one-time creation of the transformation rules for a given language, multi-time lemma generation for input words The training data set consisted of nouns only
StaLe Principle Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r# cf hens -> h# cf en -> e# cf # count cf confidence factor
Simple and Quick No internal dictionaries to setup Inflection rules from common vocabulary
Simple and Flexible Any language with inflection/derivation through affixes Knows how to lemmatize, but does not know the vocabulary
Simple and Dirty Probabilistic lemmatization Lemmatization recall over lemmatization precision ”Pseudo-lemmatization”
Simple and Strong On par with established methods in high- resource languages