Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)
Motivation ● Goal: Efficiently crawl web pages in a given language, Portuguese in our case. ● Necessity to accurately distinguish one language from others. We take a n-gram based approach to solve this problem, which has been reported to give excellent results.
Problems ● Web texts are considerably different: – Multilingual documents. – Spelling errors. – Lack of coherent sentences. – Often small amounts of textual data. These considerable differences motivate a revisit to the problem.
Outline ● Introduction. ● Context and Related Work. – Language identification. – Text categorization with n-grams. ● Our Language Identification Algorithm. ● Experimental Results. ● Future Work. ● Conclusions.
Language Identification ● Sibun and Reynar provided a good survey. ● Variety of features have been tried: – Characters, words, POS tags, n-grams,... N-gram based methods seem to be the most promising. ● Dunning, Damashek, Cavnar & Trenkle,...
N-grams in text categorization N-grams = n-character slices of a longer string. ● “tumba!” is composed of the following n-grams: – Unigrams: _, t, u, m, b, a, !, _ – Bigrams: _t, tu, um, mb, ba, a!, !_ – Trigrams: _tu, tum, umb, mba, ba!, a!_, !__ – Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___ – Quintgrams: _tumb, tumba, umba!, mba!_, ba!__, a!___, !____ ● Advantages: – Efficiently handle spelling and grammatical errors. – No need for tokenization, stemming,... – Computationally and space efficient.
Outline ● Introduction. ● Context and Related Work. ● Our Language Identification Algorithm. – N-gram categorization approach. – Measuring similarity with n-gram profiles. – Heuristics for Web documents. ● Experimental Results. ● Future Work. ● Conclusions.
N-gram categorization approach ● Measure similarity among documents through n-gram statistics. ● N-grams of multiple lengths simultaneously (1-5)
N-gram similarity - Cavnar & Treckle
More efficient similarity measures ● Lin's information theoretic similarity measure: ● Jiang and Conranth's distance formula:
Heuristics for the Web ● Use meta-data information, if available and valid. – Matching strings on the language meta tag. ● Filter common or automatically generated strings. – “optimized for Internet Explorer” ● Weight n-grams according to HTML markup. – Title, bold typeface, subject and description metatags ● Handle insufficient data. – Ignore pages with less 40 characters. ● Handle multilingualism and hard to decide cases. – Weight largest sentences.
Outline ● Introduction. ● Context and Related Work. ● Our Language Identification Algorithm. ● Experimental Results. ● Future Work. ● Conclusions.
Evaluation Experiments ● Language profiles for 23 different languages. ● Test collection: 500 documents for each of 12 different languages. ● HTML documents crawled from portals and online newspapers. ● Tested the classification algorithm in different settings. ● Lin's measure was the most accurate. ● Heuristics improve performance.
Evaluation Results
Application to the Portuguese Web About 3.5 million pages. Multiple file types. Significant portion of the Portuguese Web is written in foreign languages, especially English.
Limitations ● Unable to distinguish dialects of the same language? – Portuguese from Portugal and from Brazil. – English and American English? ● Possible directions: – Web linkage information. – “Discriminative” n-grams instead of most frequent.
Future Work ● Carefully choose better training data. ● Smoothing (Good-Turing). ● Use n-grams approach for other classification tasks.
Conclusions ● N-grams are effective in language guessing. ● Text from the Web presents problems. ● Lin's similarity measure seems effective.
Thanks for your attention!