Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.

Similar presentations


Presentation on theme: "Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1."— Presentation transcript:

1 Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1

2 Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008 Reference 2

3  Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 3

4  Given only the URL of a web page, can we identify its language?  Web crawlers  Personalized Web Browser  We consider the problem of determining the language of a web page using only its URL.  English, French, German, Spanish, and Italian .com (60%),.org (10%)  www.wasserbett-test.com Introduction 4

5  Applying machine learning techniques  Features  Word features  N-grams features  Custom-made features  Machine learning algorithm  Naïve Bayes  Decision Tree  Relative Entropy  Maximum Entropy Introduction 5

6  Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 6

7  Words as features  Remove “www”, ”index”, ”html” …,etc.  For example, http://www.internetwordstats.com/africa2.htm http://www.internetwordstats.com/africa2.htm  Split into : internetwordstats, com, africa  cnn, gov are indicative of English  Produits,recherche are indicative of French Extracting Feature Vectors 7

8  Trigrams as features  Start with the some token as the method above(word as features)  Eg, weather  “_we”, “wea”, “eat”, “ath”,”the”,”her”, “er_”  “_th”, “ing” are very common in English 8

9  Custom-made features  Top-level domain country code  OpenOffice dictionaries  Dictionary with city names  Number of hyphens 9

10  Country code top-level domain only (ccTLD)  Country code top-level domain plus (ccTLD+)  Naïve bayes (NB)  Decision Tees (DT)  Relative Entropy(RE)  Maximum Entropy(ME) Classification Algorithms 10

11  Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 11

12  The algorithms were evaluated on three different data sets  Open Directory Project  Microsoft’s Live Search  1260 pages form a large web crawl labels by hand DataSet 12

13 Data setLanguageTraining sizeTest size Open Directory Project English145,0004910 German144,9994965 French144,9964961 Spanish144,9744878 Italian144,9874933 Search Engine Results English99,992999 German99,572992 French99,549997 Spanish99,838997 Italian99,786997 Web Crawl English01082 German081 French057 Spanish019 Italian021 13

14  Introduction  Language Identification Based On URLs  Experimental Setup  Experimental Results  Conclusions Outline 14

15  P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−)))  = p(+|+)  = p(−|−)  F = 2/(1/R+1/P) 15

16 Human Performance 16

17 Baseline : ccTLD 17

18 18

19 19

20 20

21 21

22  This paper shows that high quality language identifiers for web pages can be built based on URLs alone.  The largest challenge is to identify English-looking URLs of non-English web pages. Conclusions 22


Download ppt "Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1."

Similar presentations


Ads by Google