Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Language Identification – A Syntactic Approach

Similar presentations


Presentation on theme: "Automatic Language Identification – A Syntactic Approach"— Presentation transcript:

1 Automatic Language Identification – A Syntactic Approach
Mahesh Soundalgekar November 23, 2018 CFILT, IIT Bombay

2 The Road Map Introduction System Architecture
Classification Approaches Experimental Results Summary and Future Work November 23, 2018 CFILT, IIT Bombay

3 Introduction Goal : Efficiently crawl Web pages in a given language;
Marathi in our case Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi Necessity to accurately distinguish one language from others We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB November 23, 2018 CFILT, IIT Bombay

4 Appropriate Encoding Converter
System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results November 23, 2018 CFILT, IIT Bombay

5 Classification Approaches
Most Frequently Occurring Common Words e.g. English : the, an, is, at,a etc N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. November 23, 2018 CFILT, IIT Bombay

6 Important Factors Size of the Training Data – Important to capture the
syntactic essence of a language Domains of Training Data – Usages vary from domain to domain, author to author Size of the Test Data – Small test data may not contain enough information for classification Requirement of linguistic knowledge for common words approach November 23, 2018 CFILT, IIT Bombay

7 Classifier Architecture
Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category November 23, 2018 CFILT, IIT Bombay

8 Common Words Approach List of selected common words
Matched with the test documents Closest match will give the language of the document Advantages: Intuitive Computationally Efficient Space Efficient November 23, 2018 CFILT, IIT Bombay

9 Top 5 Marathi Common Words
´É +ÉÎhÉ +É½ä ªÉÉ iÉä November 23, 2018 CFILT, IIT Bombay

10 N-Grams Approach JAVA Bi-grams: _J, JA, AV, VA, A_
Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__ November 23, 2018 CFILT, IIT Bombay

11 Measuring Distances Out_of_Place () A ER ING AND ON AR AND ER ED ON
max_value 2 1 Max_value Category profile sorted in descending order Test profile sorted in descending order Distance =3 + 2* max_value November 23, 2018 CFILT, IIT Bombay

12 Extensions to N-Grams Method
Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ Letter Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ Conjunct Granularity +ÉÊniªÉ = +É + Ên + iªÉ November 23, 2018 CFILT, IIT Bombay

13 Experimental Training Setup
Language Total size of pages in KB No. of Pages Average size of a page in KB Marathi 700 46 15.2 Hindi 600 24 25 Sanskrit 560 19 29.5 November 23, 2018 CFILT, IIT Bombay

14 Category Profiles Generated through Training
Language No. of handpicked Common Words No. of N-Grams in Atomic Approach Letter Approach Conjunct Approach Marathi 25 37633 63596 63580 Hindi 15450 26886 26865 Sanskrit 21 24119 45380 49368 November 23, 2018 CFILT, IIT Bombay

15 Classification Results
Language Common Words Atomic Approach Letter Approach Conjunct Approach Marathi 91% 95% 100% Hindi 93% 80% 92% Sanskrit 86% 50% November 23, 2018 CFILT, IIT Bombay

16 Summary and Future Work
Good results have been obtained through syntactic classification Common words technique is computationally most efficient, but with a lesser accuracy Our extensions to N-Grams give the desired accuracy N-grams technique is robust to syntax errors N-Grams technique does not require linguistic knowledge We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine November 23, 2018 CFILT, IIT Bombay


Download ppt "Automatic Language Identification – A Syntactic Approach"

Similar presentations


Ads by Google