Presentation is loading. Please wait.

Presentation is loading. Please wait.

Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia.

Similar presentations


Presentation on theme: "Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia."— Presentation transcript:

1 Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia NooJ 2014 Sassari 2014-06-04

2 Introduction  It is not hard to distinguish automatically very different languages, but similar languages like  Czech, Slovakian  Indonesian, Malaysian or  Brazilian Portuguese, European Portuguese  is very hard to distinguish even for state-of-the-art statistical tools  they often mix those languages  We use NooJ as a core part of a system designed for automatic identification of near languages  Croatian and Serbian

3 Differences: Croatian - Serbian  Lexical level (some differences)  Reflex of proto-Slavic vowel jat ije/je vs. e  e. g. milk (en) –mlijeko (hr) vs. mleko (sr)  verbs ending –irati, - ovati  e. g. to employ (en) – angažirati (hr) vs. angažovati (sr)  Construction of future tense  analytical in hr, e. g. pitat ću (I will ask)  synthetic in sr, e. g. pitaću (I will ask)  Typical structures for certain language  Croatian: modal verb + infinitive, e. g. hoću raditi  Serbian: modal verb + da + present, e. g. hoću da radim NooJ2014 Sassari 2014-06-04

4 Formalizing differences  We used only Croatian language resources  and designed morphological grammars for recognition of unknown tokens in Serbian  some words specific to Serbian are left unknown ( e. g. bread (en) – kruh (hr) vs. hleb (sr)  but it had no impact on efficiency of system  Syntactic and lexical grammars focuses on formalization of differences between languages  Examples follow… NooJ2014 Sassari 2014-06-04

5 Lexical grammars (1)  E. g. president (en) –predsjednik (hr) vs. predsednik (sr) NooJ2014 Sassari 2014-06-04

6 Lexical grammars (2)  E. g. to meet (en) –sastati (hr) vs. sastaću (sr) NooJ2014 Sassari 2014-06-04

7 Syntactic grammars (sr)  E. g. should do (en) - treba da uradi (sr) NooJ2014 Sassari 2014-06-04

8 Syntactic grammars (hr)  E. g. should do (en) - treba uraditi (sr) NooJ2014 Sassari 2014-06-04

9 Implementation  Instead of NoojApply we applied:  Fully automated process through Autohotkey  http://www.autohotkey.com/ http://www.autohotkey.com/  AutoHotkey - a scripting language for desktop automation > Max suggested  enables emulation of clicking on desktop applications  enables scripting language capabilities  Pros & cons are discussed in conclusion NooJ2014 Sassari 2014-06-04

10 System description  Open text  Apply Croatian language linguistic analyses  Count  No. of tokens  No. of Serbian lng. lexical units  No. of syntactic constructions V da V  No. of syntactic constructions V Vinf  Make decision in respect to obtained results from above processing  based on percentages of occurrences  Write statistics and results NooJ2014 Sassari 2014-06-04

11 Output of processing  Demo NooJ2014 Sassari 2014-06-04

12 Results  Testing was performed on corpus of 2500 articles from SETimes corpus  http://www.setimes.com/ http://www.setimes.com/  texts on Serbian and Croatian language  short news translated from English  System obtained precision of 99,82 %  Outperforming all known systems in this task  3 texts on Serbian language are misclassified as Croatian  texts with low recall in considered criteria NooJ2014 Sassari 2014-06-04

13 Conclusion & future work  NooJ and AutoHotkey in combination are sufficient even for performing very complex tasks  The system is completely automatized  Disadvantage: AutoHotkey is very dependent on computer screen resolution (automatic clicking)  Future work:  There is room for improvement of the system  To take into account unknown words  To tune system voting  To create lists of „forbidden” words NooJ2014 Sassari 2014-06-04

14 Thank you for your attention!


Download ppt "Near Language Identification Using NooJ Božo Bekavac, Kristina Kocijan, Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb, Croatia."

Similar presentations


Ads by Google