Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Disambiguation of homographic adjective and adverb forms in Croatian texts Danijela Merkler*, Daša Berović*, Željko Agić** * Department of Linguistics.
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC.
WSD using Optimized Combination of Knowledge Sources Authors: Yorick Wilks and Mark Stevenson Presenter: Marian Olteanu.
Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Mining and Summarizing Customer Reviews
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
MONGOLIAN TAGSET and CORPUS TAGGING J.Purev and Ch. Odbayar CRLP Center for Research on Language Processing National University of Mongolia (NUM)
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
Natural Language Processing Lecture 6 : Revision.
Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
BY TSHISHONGA AW /04/081 Co-Supervisor : Mr Reg Dodds Supervisor :Professor I.M Venter APPLYING VENDA TEXT TOWARDS THE DEVELOPMENT OF AN INTELLIGENT.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Introduction to CL & NLP CMSC April 1, 2003.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Acknowledgements Contact Information Objective An automated annotation tool was developed to assist human annotators in the efficient production of a high.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Word classes and part of speech tagging Chapter 5.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Hendrik J Groenewald Centre for Text Technology (CTexT™) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
POS Tagger and Chunker for Tamil
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
How Do We Translate? Methods of Translation The Process of Translation.
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Natural Language Processing
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,

Introduction morphosyntactic tagging asssigning word categories and subcategories to words in sentence context issues modelling sentence context handling unknown words, dealing with sparse data common approaches rule-based, stochastic, hybrid data-driven models are predominant today best performing taggers are based on SVM, CRF, HMM

Introduction data-driven tagging modules the tagger and the data data implies tagset encoding word (sub)categories a solved problem? state-of-the-art accuracy on English is 97-98% tagsets for English max. 100 different tags 1475 different morphosyntactic tags used in the Croatian Morphological Lexicon accuracy for state-of-the art taggers drops by ca 10%

Tagging Croatian texts CroTag tagger inspired by TnT and HunPos trained on manually MTE v3 annotated 118 kw corpus accuracy identical to these (96-97% EN, 85-86% HR) all are highly dependent on unknown word counts improvements using the inflectional lexicon to handle unknown words tagger voting, hibridization?

From another perspective... goals of tagging reaching perfect accuracy on full tagset or making large-scale NLP systems perform better? specific requirements users and systems always have them example: named entity normalization in Croatian Is it Ivo (m.) or Iva (f.) Sanader? specific tasks may require specific tagset design keeping speed and memory footprint reducing tagset size means raising accuracy

Reducing the tagset MulText East version 3 positional tagset, letters encode categories example: Ncmsn = noun, common, masculine, etc. the subsets 1 – strip non-inflective categories and numerals (800 tags) 2 – strip verbs (739) 3 – strip all but gender, number, case and noun type (243) 4 – remove case category (48) 5 – keep noun type category only (15) 6 – maintain part-of-speech information only (13)

Results

More results adjectives, nouns and pronouns most difficultly tagged cattegories for Croatian combination of frequency and tags used maybe these are most important to tag accurately? F1-measures on adjectives, nouns and pronouns typesubset 0subset 4subset 5 Adjective0.64± ± ±0.02 Noun0.79± ± ±0.01 Pronoun0.76± ± ±0.01

Conclusions results are as expected reducing tagset size raises tagging accuracy sacrificing information for efficiency reductions are illustrative careful tagset design required with regards to requirements further work as mentioned: reaching perfect accuracy on full tagset or making large-scale NLP systems perform better?

Your questions? Computational Linguistic Models and Language Technologies for Croatian rmjt.ffzg.hr | hml.ffzg.hr | hnk.ffzg.hr

Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,