K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.

K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder @ fer.hr, bojana.dalbelo @ fer.hr, marko.tadic @ ffzg.hr Seminar at the K. U. Leuven, Department of Computing Science Leuven 2008-05-08

K.U. Leuven Leuven 2008-05-08 Morphological Normalization Jan Šnajder, Marko Tadić University of Zagreb Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences jan.snajder @ fer.hr, bojana.dalbelo @ fer.hr, marko.tadic @ ffzg.hr Seminar at the K. U. Leuven, Department of Computing Science Leuven 2008-05-08

K.U. Leuven Leuven 2008-05-08 Talk overview  who we are?  what are we doing?  morphological processing: normalization  lemmatization vs. stemming  Mollex: a system for normalization of Croatian  usage in document indexing and text classification  collocations as features  collocation extraction by co-occurrence measures  usage of genetic programming

K.U. Leuven Leuven 2008-05-08 Who we are?  University of Zagreb, Croatia  founded 1669, 52,500 undergraduate students  two faculties in the same mission  build the systems that will develop and enable the usage of language resources and tools for Croatian

K.U. Leuven Leuven 2008-05-08 Who we are 2?  Faculty of Humanities and Social Sciences  Institute / Department of Linguistics  dealing with basic computational linguistic tasks for Croatian  compiling and processing large scale language resources  Croatian National Corpus, Croatian Morphological Lexicon, Croatian WordNet, Croatian Dependency Treebank  tagger, lemmatizer  chunker, parser  NERC system

K.U. Leuven Leuven 2008-05-08 Who we are 3?  Faculty of Electrical Engineering and Computing  Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab  Knowledge Technogies Laboratory Group deals with  text preprocessing techniques for Croatian for machine learning procedures  dimensionality reduction and document clustering in the vector space model + visualisation  automatic indexing of documents  intelligent, language specific information retrieval and extraction

K.U. Leuven Leuven 2008-05-08 What are we doing?  working jointly on several research projects  AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA)  RMJT: Computational Linguistic Models and Language Technologies for Croatian (national research programme, two of five projects)  Croatian language resources and their annotation 2007-2011, prof. Marko Tadić  Knowledge discovery in textual data 2007-2011, prof. Bojana Dalbelo Bašić  CADIAL: Computer Aided Document Indexing for Accessing Legislation  joint Flemish-Croatian project  2007-2009  prof. Marie-Francine Moens & prof. Bojana Dalbelo Bašić

K.U. Leuven Leuven 2008-05-08 Morphological processing  computational linguistic / NLP task  important for inflectionally rich languages, e.g.  Croatian noun in 14 word-forms (7 cases, 2 numbers): N: studentstudenti G: studentastudenata D: studentustudentima A: studentastudente V: studentustudenti L: studentustudentima I: studentomstudentima  unlike English noun in 2(3?) word-forms (2 numbers + possesive?): Sg: studentPoss: (student’s) Pl: students  present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish,...

K.U. Leuven Leuven 2008-05-08 Morphological processing 2  three basic subtasks in inflection processing 1.generation of (all) word-forms (WFs) of a lexeme 2.analysis of WFs i.e. recognizing the values of morphosyntactical categories of a WF in text 3.recognizing to which lexeme(s) a WF belongs to  the last one helps us in avoiding the problem of data sparsness in many text processing tasks, e.g.  information retrieval, text mining, document indexing  normalization: conflating the morphological variants of a word to a single representative form  two main ways to do that 1.linguistically motivated: lemmatization 2.computationally motivated: stemming

K.U. Leuven Leuven 2008-05-08 Morphological processing 3  lemmatization  replacing the WF with its proper base WF, usually called lemma  e.g. mapping theoretical maximum of (e.g. 14) WFs to 1 lemma  lexicon based  large lexicons of all (generated) WFs needed  preparation expensive in time and manpower  mostly realized by databases  algorithmic based  mostly FST: compact, efficient, fast  lexicon of lemmas and their inflectional patterns needed anyway

K.U. Leuven Leuven 2008-05-08 Morphological processing 4  stemming  reducing the WF from the end by truncating the possible endings  does not have to respect the linguistic boundaries vuk+Ø>*vu+kØ vuk+a >*vu+ka vuč+e>*vu+če  reducing all the WFs to a common beginning  problems where there are many morphonological adaptations sla+ti>*?+slati šalj+em>*?+šaljem

K.U. Leuven Leuven 2008-05-08 Morphological normalization  Croatian language (like most Slavic languages) is morphologically complex  elaborated inflectional and derivational morphology  problematic for most NLP applications  requires the use of substantial linguistic knowledge  our lexicon based approach to normalization is somewhere in between lemmatization and stemming  suitable for other inflectionally complex languages

K.U. Leuven Leuven 2008-05-08 Croatian Morphology 1.high degree of affixation  word-forms are obtained by suffixation, prefixation, phonological alternations, stem extension  inflection  nouns: declination (7 cases, 2 numbers) ‏  verbs: conjugation (tenses, persons, numbers, genders) ‏  adjectives: declination (7 cases, 2 numbers, 3 genders), comparison (3 degrees), and definiteness  derivation  a large number of rules for deriving nouns from verbs, verbs from nouns, possessive adjectives,...

K.U. Leuven Leuven 2008-05-08 Croatian Morphology 2  inflection examples  adjective: brz, brza, brzi, brzima, brzih, brzoj, brze, brzim, brzog, brzoga, brz, brza, brzo, brzom, brzomu, brži, bržeg, brža, brži, bržima, bržih, bržoj, brže, bržim, bržem, bržima, najbrži, bržeg, najbrža, najbržima, najbržih, najbrže, najbržim, najbrži, najbržoj,...  noun: brzina, brzinom, brzine, brzinama, brzinu, brzina, brzini  adjective: brzinski, brzinskom, brzinske, brzinskih, brzinska, brzinskoj, brzinsko, brzinskog, brzinskoga,…  adverb: brzo, brže, najbrže, brzinski  derivation examples  brz > brzina > brzinski > …

K.U. Leuven Leuven 2008-05-08 Croatian Morphology 3 2.high degree of homography  vode = voda (water) | voditi (to lead) | vod (a platoon)  requires disambiguation (POS/MSD tagging) ‏ 3.affix ambiguity  many ambiguous suffixation rules  e.g. bolnic-a / bolnic-i vs. ruk-a / ruc-i  e.g. bolnic-a / bolnic-om vs. brodolom / brodolom-a  possible mismatches at inflectional level  narančast / narančast-om vs. ruž / ruž-om (not ruža)  possible mismatches at derivational level  e.g. kralj / kralj-ica vs. stan / stan-ica

K.U. Leuven Leuven 2008-05-08 Lexicon based normalization  lexicon-based morphological normalisation  a morphological lexicon associates to each WF its morphological norm (lemma, stem,...) and, optionally, a MSD  incorporates linguistic knowledge and thus avoids aforementioned pitfalls  drawbacks  made by linguists, expensive and time-consuming  problems with coverage (neologisms, jargons, …) ‏  our approach  rule-based acquisition of large coverage morphological lexica from raw (unannotated) corpora

K.U. Leuven Leuven 2008-05-08 Our approach 1.acquisition of inflectional lexicon  input: raw corpora and sets of inflectional and derivational rules in convenient (grammarbook-like) formalism 2.normalisation of word-forms  inflectional (lemmatization) ‏  inflectional + derivational  comparable to stemming (but more precise) ‏  advantages  can be used as both a lemmatizer (with MSD) and a stemmer (with variable degree of conflation) ‏  provides good lexicon coverage  requires only limited linguistic expertise

K.U. Leuven Leuven 2008-05-08 Morphology representation  e.g. noun inflectional paradigm  vojnik (soldier) ‏ CaseSingular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima

K.U. Leuven Leuven 2008-05-08 Morphology representation 2  defines inflectional and derivational rules  uses functions as building blocks:  A) condition functions  B) string transformation functions  each defined using a higer-order function  e.g.  sfx  sfx('a')  sfx('a')('vojnik') = 'vojnika'  sfx(‘e’)  alt(pal)  (sfx('e')  alt(pal))('vojnik') = 'vojniče'

K.U. Leuven Leuven 2008-05-08 Morphology representation 3 CaseSingular Plural N vojnik-Ø vojnic-i G vojnik-a vojnik-a D vojnik-u vojnic-ima A vojnik-a vojnik-e V vojnič-e vojnic-i L vojnik-u vojnic-ima I vojnik-om vojnic-ima  ( s.ends('k','g','h')(s)   consGroup(s), {null, sfx(‘a’), sfx(‘u’), sfx(‘om’), sfx(‘e’)  alt(pal), sfx(‘i’)  alt(sib), sfx(‘ima’)  alt(sib), sfx(‘e’)}) ‏

K.U. Leuven Leuven 2008-05-08 Morphology representation 4  suitable also for more complex paradigms (c, {null, sfx(‘a’), sfx(‘u’),..., sfx(‘ima’)}  {sfx(‘og’), sfx(‘om’),..., sfx(‘ima’)}  {sfx(‘i’)  alt(jot), sfx(‘eg’)  alt(jot),..., sfx(‘ima’)  alt(jot)}  {sfx(‘i’)  alt(jot)  pfx(‘naj’),..., sfx(‘ima’)  alt(jot)  pfx(‘naj’)})

K.U. Leuven Leuven 2008-05-08 Morphology representation 5  advantages  resembles to morphology description as found in traditional grammar books  requires minimum amount of linguistic knowledge  highly expressive: arbitrary HOF functions can be defined  can be aplied to other morphologically similar languages  implemented in Haskell  purely functional programming language  requires minimum programming skills

K.U. Leuven Leuven 2008-05-08 Lexicon acquisition  uses inflectional rules + raw corpora to extract lemmas and their paradigms  uses frequency counts of WFs attested in the corpus  much of the ambiguity is resolved by language-dependent heuristics  plausibility, priority  linguistic quality is not vital  word-form conflation rather than generation  human intervention is not required

K.U. Leuven Leuven 2008-05-08 Results  example lexicon  acquired from 20 Mw newspaper corpus  based on 90 inflectional and >300 derivational rules  contains ca 42,000 lemmas associated with over 500,000 WFs  performance  linguistic quality F1 = 88% per type  coverage 96% per type and 98% per token  understemming = 7%  overstemming < 4%  can be improved further by manual editing

K.U. Leuven Leuven 2008-05-08 Derivational normalization  inflectional lexicon is partitioned into equivalence classes based on derivational rules  degree of normalisation depends on the number of derivational rules used  problem with semantics  context, degrees  derivation is not so semantically regular as inflection

K.U. Leuven Leuven 2008-05-08 References and applications  Reference  Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of Inflectional Lexica for Morphological Normalisation // Information Processing and Management, 2008. (in press)  Applied in document indexing  projects AIDE & CADIAL www.cadial.org  Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / J. Van Nieuwenhove & P. Popelier (eds). Brugge : Die Keure, 2008. pp. 107-117.  Applied in text classification  Malenica, Mislav; Šmuc, Tomislav; Jan, Šnajder; Dalbelo Bašić, Bojana. Language Morphology Offset: Text Classification on a Croatian-English Parallel Corpus. // Information Processing and Management, 44 (2008), 1; 325-339.

K.U. Leuven Leuven 2008-05-08 Thank you for your attention!

K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.

Similar presentations

Presentation on theme: "K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb.

Similar presentations

Presentation on theme: "K.U. Leuven Leuven 2008-05-08 Morphological Normalization and Collocation Extraction Jan Šnajder, Bojana Dalbelo Bašić, Marko Tadić University of Zagreb."— Presentation transcript:

Similar presentations

About project

Feedback