Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.
Advertisements

Corpus Processing and NLP
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.
Noun. Noun - verb noun Noun - verb article- adj. - adj. - Noun - verb.
Regular expressions and the Corpus Query Language
Example Database English-German Dictionary
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Stemming, tagging and chunking Text analysis short of parsing.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Programming Languages An Introduction to Grammars Oct 18th 2002.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
PARTS OF SPEECH 1 The principles of the traditional classification of the English vocabulary 2 Notional and functional parts of speech. 3 The field structure.
Tips and Tricks … with INTEX/NOOJ Tamás Váradi Institute for Linguistics Research Hungarian Academy of Sciences Max Silberztein University.
Parts of Speech (Lexical Categories). Parts of Speech Nouns, Verbs, Adjectives, Prepositions, Adverbs (etc.) The building blocks of sentences The [ N.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
PDT Grammatemes and Coreference in the PDT 2.0 Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University in Prague.
6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.
PowerConc: An R-gram Based Corpus Analysis Tool Jiajin Xu & Yunlong Jia Beijing Foreign Studies University.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Ferenc Havas Tallinn, Introduction to the project: Uralic Typology Database Project website:
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Chapter 2 Words & Paradigms Morphology Lane 333. What is a word? It’s used in more than one way There is a major ambiguity in the term The same vocabulary.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
The Unreasonable Effectiveness of Data
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Regular expressions and the Corpus Query Language Albert Gatt.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
23.3 Information Extraction More complicated than an IR (Information Retrieval) system. Requires a limited notion of syntax and semantics.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Applying Word Sketches to Russian Máša Khokhlova St.Petersburg State University
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Lecture 9: Part of Speech
Approaches to Machine Translation
Translation Problems.
Universal Dependencies
Universal Dependencies
Approaches to Machine Translation
Statistical n-gram David ling.
Natural Language Processing
Presentation transcript:

Principles of organizing a common morphological tagset and a search engine for PolUKR (Polish-Ukrainian Parallel Corpus) Польсько-Український паралельний корпус Polsko-Ukraiński Korpus Równoległy Natalia Kotsyba, Institute of Slavic Studies, Polish Academy of Sciences Olga Shypnivska, ULIF, Ukrainian Academy of Sciences Magdalena Turska, Warsaw University

Main objectives and expected applications at least 3 mln tokens ; representative sentence-level alignment morphological annotation with a common tagset public access; user-friendly linguistic material for –(independent) language learning –bilingual dictionaries –research on grammar and lexis translation memory for humans and machines

Statistics (prototype version) totalPolish partUkrainian part texts7035 tokens characters Kb

Search (present) based on PERL regular expressions any searched chain has to be “embraced” by “/”. E.g. /Холодна війна/ special characters: І alternative; ) end of subchain [ i ] beginning and end of a defined character class ? 1 or 0 appearances; * 0 or more appearances + 1 or more appearances \s any empty character \w any letter, digit, underlining sign \b end of word, \ escape

Examples of search formulae /jako/  „jako” /jako\s/  „jako, niejako, dwojako” /\bjako/  „jakość’ /norma\./  „norma” before a dot

Sources of morphological information Polish: IPI PAN corpus + … Ukrainian: -grammatical dictionary by ULIF, UAS (Igor Shevchenko) lemma <> wordform -morphological analyzer (information is slightly different, built for homonymy disambiguation) -no lemmatization (so far)

Types of tagsets SYMBOLS: encoding all possible grammatical characteristics of a wordform in one symbol English (BNC), Ukrainian - takes little machine memory but requires too much of the human one CHAINS: contain codes corresponding to particular grammatical categories and/or their values; morphological characteristics of a wordform is represented by a sequence of such codes can be even more economic than symbols, if a query concerns morphological categories owned by several lexico-grammatical classes positional Czech every category (and its values) have a fixed position in a chain flexemic Polish, Russian every category has its own subtagset

Multext-East tagset for En Ro Sl Cz Bg Et Hu Hr Sr Re chain-like; criticised 14 PoS: N10, V15, A12, P(ron)17, Det10, T(he)6, adveRb6, S(adposition)4, C(onj)7, nuMeral12, Intjn2, X(residual), Yabbr5, Qparticle3 only Bg and Hu do not have modal verbs and copulas En Ro have determiners, Ro Hu Re have articles, Bg – has neither (analitism, segmentation); Is a Bg noun formally indefinite if the article is attached to the adj? (cf. agglutinativity of Pl być) negation as morphological category Cz transgresivity (adverbial participle)

Treatment of participles Polish (no aspectual characteristics) (Here and further cited by: Adam Przepiórkowski i Marcin Woliński A Flexemic Tagset for Polish.)A Flexemic Tagset for Polish Ukrainian (aspect and tense) Дієслово, дієприслівник, доконаний вид, минулий час, активний стан VW прочитавши Дієслово, дієприслівник, недоконаний вид, теперішній час, активний стан UQ читаючи (Here and further cited by: Широков В.А et al. Корпусна лінгвістика.) PolUKR participle I (doing/having done) characterised by aspect

Treatment of pronouns notorious Slavonic pronoun problem: 296 unique tags for 309 pronouns Polish: division into 1-2 p, 3p and siebie (ów, jak?) Ukrainian: pro-noun, pro-adjective Russian: also pro-predicative and pro-adverb Czech: many subcategories on the level of SubPoS PolUKR: Ua approach and Pl division into 1-2 and 3 person

Treatment of predicatives Polish: adverbs with modal semantics like można, trzeba (it is) allowed/one can, (it is) necessary, ?to Ukrainian (code X0) includes adverbs of state like жарко, шкода, жаль (it is) hot, (it is) a pity PolUKR moving the category from the morphological level to the semantic one

Search engine for PolUKR choose the direction of the search (Ua>Pl or Pl<Ua) search conditions for both languages (RvonW) 3 levels of search: -exact form -(lemma) with the morphological choice -using Poliqarp-like tag formulas (for advanced users) idea of subcategories (either a POS or a SUBPOS can be selected, but not both; similarly, one cannot select all subcategories of a POS), cf. aliases in IPI PAN corpus alternative is ensured through tick-off boxes, so that one can choose EITHER „VERB finite past” OR „NOUN dative neutral” OR sth else, etc.) restrictions on choice within 1 of 10 POS

Built-in restrictions on search

Literature INTERA unified tagset project Tomas Erjavec et al. Multext-East specifications for Slavic languages, Budapest, Jan Hajič. Positional Tags: Quick Reference (Czech „HM” Morphology), Adam Przepiórkowski and Marcin Woliński. A Flexemic Tagset for Polish. In: The Proceedings of the Workshop on Morphological Processing of Slavic Languages, EACL Flexemic Tagset for Polish Elena Paskaleva. Balcan South-East Corpora Aligned to English. In: The Proceedings of the Workshop on Common Natural Language Processing Paradigm for Balkan Languages, EACL 2003 Широков В.А et al. Корпусна лінгвістика. Київ: Довіра, 2005.