Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

Development of a German- English Translator Felix Zhang.
Corpus Processing and NLP
BalkaNet project overview Dan Tufiş Dan Cristea Sofia Stamou RACAI UAIC DBLAB.
Dr. Radhika Mamidi Corpus. What is a Corpus? a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically.
Example Database English-German Dictionary
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Eleni Galiotou, Dept. of Informatics
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Stemming, tagging and chunking Text analysis short of parsing.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Research methods in corpus linguistics Xiaofei Lu.
Review of the paper entitled “The development of a phonetically balanced word recognition test in the Ilocano language” written by Renita Sagon, Doctor.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Kalyani Patel K.S.School of Business Management,Gujarat University.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
WordNet Enhancements: Toward Version 2.0 WordNet Connectivity Derivational Connections Disambiguated Definitions Topical Connections.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Communicative and Academic English for the EFL Professional.
Morphological typology
Tools for Linguistic Analysis. Overview of Linguistic Tools  Dictionaries  Linguistic Inquiry and Word Count (LIWC) Linguistic Inquiry and Word Count.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
عمادة التعلم الإلكتروني والتعليم عن بعد
Lexicons, Concept Networks, and Ontologies
Measuring Monolinguality
Introduction to Linguistics
Statistical NLP: Lecture 13
A Statistical Model for Parsing Czech
WordNet: A Lexical Database for English
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
Presentation transcript:

Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos Papakitsos 1 1 Department of Informatics and Telecommunications, University of Athens, Greece 2 Department of Informatics, Technological Educational Institute of Athens, Athens, Greece {harryk, gregor,

Greek Wordnet Development Part of the BalkaNet Project multilingual lexical database with semantic relations for each of the following languages: Bulgarian, Czech, Greek, Romanian, Serbian and Turkish. The deployment of computational tools and resources has been proven to be of major importance for the development of the monolingual Greek Wordnet (Galiotou et al).

Use of Lemmatizer in Greek Wordnet Development A lemmatizer for the Greek language has been used as the basis of a number of tools supporting the extraction and processing of linguistic information from dictionaries and corpora. Most existing lemmatizers for Greek are tools that support specific applications, or parts of systems for full morphological processing that require a large number of lexical resources. In our case it wasn ’ t possible to use such resource as morphological dictionaries or annotated corpora. Our design goals Lemmatizer useful for a number of different tools Requires as few lexical resources as possible Computationally efficient.

Modern Greek Language Overview (1/2) The lemmatizer must take into account the peculiarities of the Greek language Greek is a highly inflected language Nouns decline for number and case Adjective decline for number, case, gender and degree. Each verb has about 70 distinct forms. Verbs conjugate for voice, mood, tense, aspect, number and person. Each verb has about 60 distinct forms.

Modern Greek Language Overview (2/2) Word Stress Each word of two or more syllables has a stressed syllable that is pronounced the loudest, and in written script it is denoted by a stress mark (') over the nuclear vowel of the syllable. Word stress in Greek is distinguishing (e.g. νόμος ('nomos - law) is different from νομός (no'mos - administrative region). Word stress is moving i.e. the stress may change its position within the inflectional paradigm of the same word. E.g the word θάλασ σ α ('θalasa - sea) in the genitive plural case becomes θαλασ σ ών (θala'son - of the seas).

Lemmatizer for the Greek Language Given a word in Greek as input, the lemmatizer analyzes the word and finds its dictionary citation form. Lexical Information Required by Lemmatizer List of the citation forms of words. Our list was compiled from an electronic dictionary and automatically extended with some productive derivations (e.g. diminutives). It contains around words. A list containing information about how words are inflected in Greek. Each entry contains information about possible inflectional endings and about stress movement. List of irregular forms of words. So far this list has about 400 such words.

A Lemmatizer for the Greek Language (Short) description of the algorithm of the Lemmatizer  First we try to find the input word in the list of citation forms.  Then we try to find the input word in the list of irregular forms.  Then we try to match the ending of the word with the inflectional endings in the list of inflectional information. If an ending is found then it is removed so as to find the stem of the word. The stem is then used to form a number of possible citation forms of the input word. Finally, we search for these words in the list of citation forms and if it is found we consider it as a possible citation form of the input word.

Tools for Wordnet Development and Validation Lemmatized Word-frequency Counter Translator of Words from Greek to English Part of Speech Tagger

Lemmatized Word-frequency Counter This tool counts the occurrences of words in corpora, regardless of the inflectional type in which they appear. In Wordnet development, when determining base concepts it is useful to be aware of the frequency of words in corpora, so as to avoid using as base concepts words which might be frequent in English but infrequent in Greek.

Translator of Words from Greek to English Given a Greek word, this tool finds the English translation of that word based on a bilingual Greek- English dictionary. Unlike English, Greek is a highly inflected language, so different forms of a word in Greek correspond to the same English word. The tool first calls the lemmatizer to find the citation form of the word and then looks it up in a bilingual Greek to English dictionary to find its English translation. In the framework of Wordnet development it is used to find the correspondence of words appearing in Greek corpora to their Inter-Lingual-Index (ILI) numbers or to directly find the equivalent in Princeton WordNet.

Part of Speech Tagger By adding information about the part of speech of words we extended the lemmatizer into a part of speech tagger for Greek texts. Enhanced with local disambiguation such a POS tagger can handle most tagging problems in the Greek language. The part of speech tagger was used for the annotation of a Greek language corpus The text of George Orwell's 1984, which contains around words was used. This will be used for producing comparative coverage statistics for the wordnets in BalkaNet has already been aligned and annotated for the rest of the languages of Balkanet (except Turkish) as part of the Multext-East project (Erjavec et al.)

Conclusions A lemmatizer is very useful to the processing of a highly inflected language such as Greek. We can create a cost effective lemmatizer without need for complicated and hard to find (or build) resources. Such a lemmatizer can be used as part of a number of other computational tools for Wordnet development and validation. We have presented three such tools and their application in the framework of the BalkaNet project.