Download presentation
Presentation is loading. Please wait.
Published byJanel Byrd Modified over 9 years ago
1
K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder @ fer.hr, bojana.dalbelo @ fer.hr, marko.tadic @ ffzg.hr Seminar at the K. U. Leuven, Department of Computing Science Leuven 2008-05-08
2
K.U. Leuven Leuven 2008-05-08 Morphological Normalization Jan Šnajder, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder @ fer.hr, bojana.dalbelo @ fer.hr, marko.tadic @ ffzg.hr Seminar at the K. U. Leuven, Department of Computing Science Leuven 2008-05-08
3
K.U. Leuven Leuven 2008-05-08 Talk overview who we are? what are we doing? morphological processing: normalization lemmatization vs. stemming Mollex: a system for normalization of Croatian usage in document indexing and text classification collocations as features collocation extraction by co-occurrence measures usage of genetic programming
4
K.U. Leuven Leuven 2008-05-08 Who we are? University of Zagreb, Croatia founded 1669, 52,500 undergraduate students two faculties in the same mission build the systems that will develop and enable the usage of language resources and tools for Croatian
5
K.U. Leuven Leuven 2008-05-08 Who we are 2? Faculty of Humanities and Social Sciences Institute / Department of Linguistics dealing with basic computational linguistic tasks for Croatian compiling and processing large scale language resources Croatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebank tagger, lemmatizer chunker, parser NERC system
6
K.U. Leuven Leuven 2008-05-08 Who we are 3? Faculty of Electrical Engineering and Computing Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab Knowledge Technogies Laboratory Group deals with text preprocessing techniques for Croatian for machine learning procedures dimensionality reduction and document clustering in the vector space model + visualisation automatic indexing of documents intelligent, language specific information retrieval and extraction
7
K.U. Leuven Leuven 2008-05-08 What are we doing? working jointly on several research projects AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects) Croatian language resources and their annotation 2007-2011, prof. Marko Tadić Knowledge discovery in textual data 2007-2011, prof. Bojana Dalbelo Bašić CADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project 2007-2009 prof. Marie-Francine Moens & prof. Bojana Dalbelo Bašić
8
K.U. Leuven Leuven 2008-05-08 Morphological processing computational linguistic / NLP task important for inflectionally rich languages, e.g. Croatian noun in 14 word-forms (7 cases, 2 numbers): N: studentstudenti G: studentastudenata D: studentustudentima A: studentastudente V: studentustudenti L: studentustudentima I: studentomstudentima unlike English noun in 2(3?) word-forms (2 numbers + possesive?): Sg: studentPoss: (student’s) Pl: students present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish,...
9
K.U. Leuven Leuven 2008-05-08 Morphological processing 2 three basic subtasks in inflection processing 1.generation of (all) word-forms (WFs) of a lexeme 2.analysis of WFs i.e. recognizing the values of morphosyntactical categories of a WF in text 3.recognizing to which lexeme(s) a WF belongs to the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e.g. information retrieval, text mining, document indexing normalization: conflating the morphological variants of a word to a single representative form two main ways to do that 1.linguistically motivated: lemmatization 2.computationally motivated: stemming
10
K.U. Leuven Leuven 2008-05-08 Morphological processing 3 lemmatization replacing the WF with its proper base WF, usually called lemma e.g. mapping theoretical maximum of (e.g. 14) WFs to 1 lemma lexicon based large lexicons of all (generated) WFs needed preparation expensive in time and manpower mostly realized by databases algorithmic based mostly FST: compact, efficient, fast lexicon of lemmas and their inflectional patterns needed anyway
11
K.U. Leuven Leuven 2008-05-08 Morphological processing 4 stemming reducing the WF from the end by truncating the possible endings does not have to respect the linguistic boundaries vuk+Ø>*vu+kØ vuk+a >*vu+ka vuč+e>*vu+če reducing all the WFs to a common beginning problems where there are many morphonological adaptations sla+ti>*?+slati šalj+em>*?+šaljem
12
K.U. Leuven Leuven 2008-05-08 Morphological normalization Croatian language (like most Slavic languages) is morphologically complex elaborated inflectional and derivational morphology problematic for most NLP applications requires the use of substantial linguistic knowledge our lexicon based approach to normalization is somewhere in between lemmatization and stemming suitable for other inflectionally complex languages
13
K.U. Leuven Leuven 2008-05-08 Croatian Morphology 1.high degree of affixation word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension inflection nouns: declination (7 cases, 2 numbers) verbs: conjugation (tenses, persons, numbers, genders) adjectives: declination (7 cases, 2 numbers, 3 genders), comparison (3 degrees), and definiteness derivation a large number of rules for deriving nouns from verbs, verbs from nouns, possessive adjectives,...
14
K.U. Leuven Leuven 2008-05-08 Croatian Morphology 2 inflection examples adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, brz, brza, brzo, brzom, brzomu, brži, bržeg, brža, brži, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj,... noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinsko, brzinskog, brzinskoga,… adverb: brzo, brže, najbrže, brzinski derivation examples brz > brzina > brzinski > …
15
K.U. Leuven Leuven 2008-05-08 Croatian Morphology 3 2.high degree of homography vode = voda (water) | voditi (to lead) | vod (a platoon) requires disambiguation (POS/MSD tagging) 3.affix ambiguity many ambiguous suffixation rules e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i e.g. bolnic-a / bolnic-om vs. brodolom / brodolom-a possible mismatches at inflectional level narančast / narančast-om vs. ruž / ruž-om (not ruža) possible mismatches at derivational level e.g. kralj / kralj-ica vs. stan / stan-ica
16
K.U. Leuven Leuven 2008-05-08 Lexicon based normalization lexicon-based morphological normalisation a morphological lexicon associates to each WF its morphological norm (lemma, stem,...) and, optionally, a MSD incorporates linguistic knowledge and thus avoids aforementioned pitfalls drawbacks made by linguists, expensive and time-consuming problems with coverage (neologisms, jargons, …) our approach rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora
17
K.U. Leuven Leuven 2008-05-08 Our approach 1.acquisition of inflectional lexicon input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism 2.normalisation of word-forms inflectional (lemmatization) inflectional + derivational comparable to stemming (but more precise) advantages can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation) provides good lexicon coverage requires only limited linguistic expertise
18
K.U. Leuven Leuven 2008-05-08 Morphology representation e.g. noun inflectional paradigm vojnik (soldier) CaseSingular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima
19
K.U. Leuven Leuven 2008-05-08 Morphology representation 2 defines inflectional and derivational rules uses functions as building blocks: A) condition functions B) string transformation functions each defined using a higer-order function e.g. sfx sfx('a') sfx('a')('vojnik') = 'vojnika' sfx(‘e’) alt(pal) (sfx('e') alt(pal))('vojnik') = 'vojniče'
20
K.U. Leuven Leuven 2008-05-08 Morphology representation 3 CaseSingular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima ( s.ends('k','g','h')(s) consGroup(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’) alt(pal), sfx(‘i’) alt(sib), sfx(‘ima’) alt(sib), sfx(‘e’)})
21
K.U. Leuven Leuven 2008-05-08 Morphology representation 4 suitable also for more complex paradigms (c, {null, sfx(‘a’), sfx(‘u’),..., sfx(‘ima’)} {sfx(‘og’), sfx(‘om’),..., sfx(‘ima’)} {sfx(‘i’) alt(jot), sfx(‘eg’) alt(jot),..., sfx(‘ima’) alt(jot)} {sfx(‘i’) alt(jot) pfx(‘naj’),..., sfx(‘ima’) alt(jot) pfx(‘naj’)})
22
K.U. Leuven Leuven 2008-05-08 Morphology representation 5 advantages resembles to morphology description as found in traditional grammar books requires minimum amount of linguistic knowledge highly expressive: arbitrary HOF functions can be defined can be aplied to other morphologically similar languages implemented in Haskell purely functional programming language requires minimum programming skills
23
K.U. Leuven Leuven 2008-05-08 Lexicon acquisition uses inflectional rules + raw corpora to extract lemmas and their paradigms uses frequency counts of WFs attested in the corpus much of the ambiguity is resolved by language-dependent heuristics plausibility, priority linguistic quality is not vital word-form conflation rather than generation human intervention is not required
24
K.U. Leuven Leuven 2008-05-08 Results example lexicon acquired from 20 Mw newspaper corpus based on 90 inflectional and >300 derivational rules contains ca 42,000 lemmas associated with over 500,000 WFs performance linguistic quality F1 = 88% per type coverage 96% per type and 98% per token understemming = 7% overstemming < 4% can be improved further by manual editing
25
K.U. Leuven Leuven 2008-05-08 Derivational normalization inflectional lexicon is partitioned into equivalence classes based on derivational rules degree of normalisation depends on the number of derivational rules used problem with semantics context, degrees derivation is not so semantically regular as inflection
26
K.U. Leuven Leuven 2008-05-08 References and applications Reference Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexica for Morphological Normalisation // Information Processing and Management, 2008. (in press) Applied in document indexing projects AIDE & CADIAL www.cadial.org Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107-117. Applied in text classification Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325-339.
27
K.U. Leuven Leuven 2008-05-08 Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.