Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.

Slides:



Advertisements
Similar presentations
Extended-Response Questions
Advertisements

KeTra.
Sentence Classification and Clause Detection for Croatian Kristina Vučković, Željko Agić, Marko Tadić Department of Information Sciences, Department of.
Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
OAA Extended Response/Short Answer
Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.
C ORPUS P ROCESSING Kristina Kocijan Department of Information and Communication Sciences Faculty of Humanities and Social Sciences University.
Anna Sågvall Hein, GSLT, January 2003 Direct translation no intermediary sentence structure translation proceeds in a number of steps, each step dedicated.
Leveraging TM Technology to Improve Translatability & Usability Dr Jody Byrne University of Sheffield.
Improved Parser for Simple Croatian Sentences NooJ2010 Komotini 1/22 Improved Parser for Simple Croatian Sentences Kristina Vučković, Božo Bekavac, Zdravko.
1 Linguistics and translation theory Mark Shuttleworth Teaching Translation Swansea, 20 January 2006.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
Search Engines and Information Retrieval
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Improved Parser for Simple Croatian Sentences Kristina Vučković, Božo Bekavac, Zdravko Dovedan University of Zagreb, Faculty of Humanities and Social Sciences.
NooJ2009 Tozeur /22 SynCro - Parsing Simple Croatian Sentences Kristina Vučković, Božo Bekavac, Zdravko Dovedan University of Zagreb, Faculty.
MACHINE TRANSLATION A precious key to communicate beyond linguistic barriers 1.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Automated Essay Evaluation Martin Angert Rachel Drossman.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
2002 October 10SFWR ENG 4G030 Translating from English into Mathematics SFWR ENG 4G Robert L. Baber.
Search Engines and Information Retrieval Chapter 1.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Computational Investigation of Palestinian Arabic Dialects
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Averil Coxhead Hüsem Korkmaz MA TEFL. was developed from a corpus of 5 million words with the needs of ESL/EFL learners in mind, contains the most widely.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Chapter 6 Programming Languages (2) Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Language Independent Method for Question Classification COLING 2004.
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
FF & FER INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009 Comparative Analysis of Automatic Term and Collocation Extraction Sanja.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Stentor A new Computer-Aided Transcription software for French language.
Information Transfer through Online Summarizing and Translation Technology Sanja Seljan*, Ksenija Klasnić**, Mara Stojanac*, Barbara Pešorda*, Nives Mikelić.
Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics Projecte de Tesi Elisabet Comelles Directores Irene Castellon i Victoria Arranz.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Comparative Structures in Croatian: MWU Approach Kristina Kocijan, Sara Librenjak Department of Information and Communication Sciences University of Zagreb.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
2 Software CASE tools state-of-the-art UML modeling Partially automatic code generation Refactoring browsers (occasionally) Context-sensitive search and.
TSD, Brno, Institute of Formal and Applied Linguistics, 1 Czech Verbs of Communication and the Extraction of.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Pedagogical Grammar Pro. Penny Ur Teaching the First Conditional A presentation by Shulamit Bar-Ilan June 22.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;
WP4 Models and Contents Quality Assessment
Lexical and Syntax Analysis
An Artificial Intelligence Approach to Precision Oncology
Constructing the Croatian resources for e-learning of Japanese
REPORT WRITING Many types but two main kinds:
Writing Analytics Clayton Clemens Vive Kumar.
Using Translation Memory to Speed up Translation Process
Text Mining & Natural Language Processing
Experience with the process automation at SORS
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –
Presentation transcript:

Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia NooJ 2014 Sassari

Introduction  It is not hard to distinguish automatically very different languages, but similar languages like  Czech, Slovakian  Indonesian, Malaysian or  Brazilian Portuguese, European Portuguese  is very hard to distinguish even for state-of-the-art statistical tools  they often mix those languages  We use NooJ as a core part of a system designed for automatic identification of near languages  Croatian and Serbian

Differences: Croatian - Serbian  Lexical level (some differences)  Reflex of proto-Slavic vowel jat ije/je vs. e  e. g. milk (en) –mlijeko (hr) vs. mleko (sr)  verbs ending –irati, - ovati  e. g. to employ (en) – angažirati (hr) vs. angažovati (sr)  Construction of future tense  analytical in hr, e. g. pitat ću (I will ask)  synthetic in sr, e. g. pitaću (I will ask)  Typical structures for certain language  Croatian: modal verb + infinitive, e. g. hoću raditi  Serbian: modal verb + da + present, e. g. hoću da radim NooJ2014 Sassari

Formalizing differences  We used only Croatian language resources  and designed morphological grammars for recognition of unknown tokens in Serbian  some words specific to Serbian are left unknown ( e. g. bread (en) – kruh (hr) vs. hleb (sr)  but it had no impact on efficiency of system  Syntactic and lexical grammars focuses on formalization of differences between languages  Examples follow… NooJ2014 Sassari

Lexical grammars (1)  E. g. president (en) –predsjednik (hr) vs. predsednik (sr) NooJ2014 Sassari

Lexical grammars (2)  E. g. to meet (en) –sastati (hr) vs. sastaću (sr) NooJ2014 Sassari

Syntactic grammars (sr)  E. g. should do (en) - treba da uradi (sr) NooJ2014 Sassari

Syntactic grammars (hr)  E. g. should do (en) - treba uraditi (sr) NooJ2014 Sassari

Implementation  Instead of NoojApply we applied:  Fully automated process through Autohotkey   AutoHotkey - a scripting language for desktop automation > Max suggested  enables emulation of clicking on desktop applications  enables scripting language capabilities  Pros & cons are discussed in conclusion NooJ2014 Sassari

System description  Open text  Apply Croatian language linguistic analyses  Count  No. of tokens  No. of Serbian lng. lexical units  No. of syntactic constructions V da V  No. of syntactic constructions V Vinf  Make decision in respect to obtained results from above processing  based on percentages of occurrences  Write statistics and results NooJ2014 Sassari

Output of processing  Demo NooJ2014 Sassari

Results  Testing was performed on corpus of 2500 articles from SETimes corpus   texts on Serbian and Croatian language  short news translated from English  System obtained precision of 99,82 %  Outperforming all known systems in this task  3 texts on Serbian language are misclassified as Croatian  texts with low recall in considered criteria NooJ2014 Sassari

Conclusion & future work  NooJ and AutoHotkey in combination are sufficient even for performing very complex tasks  The system is completely automatized  Disadvantage: AutoHotkey is very dependent on computer screen resolution (automatic clicking)  Future work:  There is room for improvement of the system  To take into account unknown words  To tune system voting  To create lists of „forbidden” words NooJ2014 Sassari

Thank you for your attention!