Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies.

Similar presentations


Presentation on theme: "A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies."— Presentation transcript:

1 A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies and Interactive Media University of Tampere, Finland

2 The goal of our work To create lemmatizer for low-resource languages Specifically for IR Effective Fast setup On par with gold standards in well established languages

3 Problem domain Morphological normalization is essential Morphologically complex languages Also a factor in less complex languages Word inflection causes problems Monolingual query-index mismatches Cross-lingual translation mismatches

4 Problem domain Lemmatization over stemming Less ambiguity in text-based IR Accurate token translation in CLIR

5 Lemmatization Several approaches, e.g. Dictionary-based methods  Internal dictionaries -> need for updates, OOV Corpus analyzation methods  Closed corpus -> must be trained for other corpora Pure rule-based methods  Probabilistic method -> precision loss

6 Lemmatization problems Out-of-vocabulary words (names, new words, loan words, etc.) Dictionary-based methods won’t work Probabilistic methods aren’t necessary precise

7 Lemmatizer problems Linguistically good lemmatizers Can be heavy Can be expensive Can produce more data than necessary

8 Simplify We only need effectiveness in IR Why use methods that do more than what we need them to? Why try to handle inflectional cases that have minimal effect in IR?

9 Experimental method: StaLe StaLe is a statistical, rule-based lemmatizer – also for OOV processing Two phases: one-time creation of the transformation rules for a given language, multi-time lemma generation for input words The training data set consisted of nouns only

10 StaLe Principle Learning corpus Häuser -> Haus Lehrerinnen -> Lehrer Menschens -> Mensch Säulen -> Säule Nouns only Rules learned häuser -> haus # cf rinnen -> r# cf hens -> h# cf en -> e# cf # count cf confidence factor

11 Simple and Quick No internal dictionaries to setup Inflection rules from common vocabulary

12 Simple and Flexible Any language with inflection/derivation through affixes Knows how to lemmatize, but does not know the vocabulary

13 Simple and Dirty Probabilistic lemmatization Lemmatization recall over lemmatization precision ”Pseudo-lemmatization”

14 Simple and Strong On par with established methods in high- resource languages


Download ppt "A Dictionary- and Corpus-Independent Statistical Lemmatizer for IR in Low Resource Languages Aki Loponen Kalervo Järvelin Department of Information Studies."

Similar presentations


Ads by Google