Download presentation
Presentation is loading. Please wait.
Published byJessica Clark Modified over 9 years ago
1
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1
2
Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008 Reference 2
3
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions Outline 3
4
Given only the URL of a web page, can we identify its language? Web crawlers Personalized Web Browser We consider the problem of determining the language of a web page using only its URL. English, French, German, Spanish, and Italian .com (60%),.org (10%) www.wasserbett-test.com Introduction 4
5
Applying machine learning techniques Features Word features N-grams features Custom-made features Machine learning algorithm Naïve Bayes Decision Tree Relative Entropy Maximum Entropy Introduction 5
6
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions Outline 6
7
Words as features Remove “www”, ”index”, ”html” …,etc. For example, http://www.internetwordstats.com/africa2.htm http://www.internetwordstats.com/africa2.htm Split into : internetwordstats, com, africa cnn, gov are indicative of English Produits,recherche are indicative of French Extracting Feature Vectors 7
8
Trigrams as features Start with the some token as the method above(word as features) Eg, weather “_we”, “wea”, “eat”, “ath”,”the”,”her”, “er_” “_th”, “ing” are very common in English 8
9
Custom-made features Top-level domain country code OpenOffice dictionaries Dictionary with city names Number of hyphens 9
10
Country code top-level domain only (ccTLD) Country code top-level domain plus (ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME) Classification Algorithms 10
11
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions Outline 11
12
The algorithms were evaluated on three different data sets Open Directory Project Microsoft’s Live Search 1260 pages form a large web crawl labels by hand DataSet 12
13
Data setLanguageTraining sizeTest size Open Directory Project English145,0004910 German144,9994965 French144,9964961 Spanish144,9744878 Italian144,9874933 Search Engine Results English99,992999 German99,572992 French99,549997 Spanish99,838997 Italian99,786997 Web Crawl English01082 German081 French057 Spanish019 Italian021 13
14
Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions Outline 14
15
P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−))) = p(+|+) = p(−|−) F = 2/(1/R+1/P) 15
16
Human Performance 16
17
Baseline : ccTLD 17
18
18
19
19
20
20
21
21
22
This paper shows that high quality language identifiers for web pages can be built based on URLs alone. The largest challenge is to identify English-looking URLs of non-English web pages. Conclusions 22
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.