Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.

Similar presentations


Presentation on theme: "Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences."— Presentation transcript:

1 Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011

2 Overview Introduction Experimental setup ▫Languages observed Methods used ▫Main approaches ▫Hybrid approaches Results ▫Document level ▫Paragraph level Conclusion Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

3 Introduction Web as a rich source of linguistic material More than one natural language within such sources Defining the method for language identification of the data collected from the Web ▫Comparison of two main and two hybrid approaches Ultimate goal ▫Using Web resources as a basis for constructing corpora – building hrWaC, the Croatian Web corpus Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

4 Experimental setup csdeenesfrhrhuitplskslsv cs-1822262253253142705423 de18-34 3512173120171853 en2234-273316 3515171935 es263427-6222185618232838 fr22353362-18154815182235 hr5312162218-113139517424 hu251716181511-1410221321 it31 3556483114-22283832 pl4220151815391022-504018 sk7017 231851222850-5522 sl54181928227413384055-26 sv2353353835242132182226- Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Twelve languages observed Table 1: A snippet from Language Similarity Table (Scannell, 2007)

5 Methods used Main approaches ▫Function word distributions ▫Second-order Markov models Hybrid approaches ▫Harmonic balance ▫Sophisticated method Language identification on document and paragraph level Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Function words Character count Czech210150601 German334150156 English230150041 Spanish217150926 French260150083 Croatian204157366 Hungarian223152202 Italian219150459 Polish268150198 Slovak168150046 Slovenian256143841­ Swedish256150762 Table 2: Amount of data collected for each basic method

6 Methods used – main approaches Function word distributions ▫Lists of function words from all languages in question ▫The algorithm chooses the language for which the highest percentage of words could be identified as function words of the respective language Second-order Markov models ▫Conditional probabilities of a character regarding the two previous characters for which distribution s of bigram and trigram characters are calculated on a training set Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

7 Methods used - hybrid approaches Harmonic balance ▫Harmonic mean of the certainty of the function words method and the Markov model method ▫Certainty is calculated as a/(a+b) where a is the first result, and b the second best result Sophisticated hybrid method ▫Takes into account the strengths of each main method Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

8 Methods used - hybrid approaches Sophisticated hybrid method algorithm ▫If the Markov model and function words method give the same results, the result is accepted ▫In case the results of both models are not the same, but the second best result of the Markov model method is identical to the first result of the function words method and its certainty is over 0.6, the result of the function word method is accepted ▫Otherwise the result of the Markov model method is accepted Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

9 Methods used - evaluation Document level ▫20 documents per language ▫Documents containing less than 70% of any language are considered unsolvable Paragraph level ▫Paragraphs in 50 documents were labeled by language they are written in ▫750 paragraphs in total Evaluation measure is accuracy ▫a+d/a+b+c+d Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

10 Results Main approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Function words Markov model Function words Markov model Document levelParagraph level Positive234239745747 Negative6153 Accuracy0.9750.9960.9930.996 Table 3: Results of the evaluation of the traditional approaches

11 Results Hybrid approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora Harmonic balance Sophisticated method Harmonic balance Sophisticated method Document levelParagraph level Positive239240746747 Negative1043 Accuracy0.9961.00.9950.996 Table 4: Results of the evaluation of hybrid methods

12 Conclusion Markov model outperforms the function words method Hybrid approaches showed to be more efficient on the document level (mixed language content) Power-lawish distribution of languages Three languages - 99% of the data Around 96% of documents written in only one language ▫4% have mixed content Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

13 Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011


Download ppt "Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences."

Similar presentations


Ads by Google