Automatic Language Identification – A Syntactic Approach

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
The Marathi Portal with a Search Engine Center for Indian Language Technology Solutions, IIT Bombay.
Recuperação de Informação B Cap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3 November 01, 1999.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
People Technical AdvisorsAcademic AdvisorFinal Project By Prof. Shlomi Dolev Prof. Ehud Gudes Boaz Hilemsky Dr. Aryeh Kontorovich Moran Cohavi Gil Sadis.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 5: Information Retrieval and Web Search
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
Survey of Semantic Annotation Platforms
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
IBM Globalization Center of Competency © 2006 IBM Corporation IUC 29, Burlingame, CAMarch 2006 Automatic Character Set Recognition Eric Mader, IBM Andy.
The identification of interesting web sites Presented by Xiaoshu Cai.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 3. Computing System Fundamentals 3.1 Language Translators.
Implementation Issues Mark Davis Properties.
Chapter 6: Information Retrieval and Web Search
Mining Reference Tables for Automatic Text Segmentation Eugene Agichtein Columbia University Venkatesh Ganti Microsoft Research.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Sorting it all out: An introduction to collation Cathy Wissink Michael Kaplan Globalization Infrastructure and Font Technology Windows International Microsoft.
Retroactive Answering of Search Queries Beverly Yang Glen Jeh.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Proposed Vedic Sanskrit Coding Scheme: Some suggestions Akshar Bharati Amba Kulkarni Department of Sanskrit Studies University of Hyderabad Hyderabad
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Developing GRID Applications GRACE Project
Stochastic Text Models for Music Categorization Carlos Pérez-Sancho, José M. Iñesta, David Rizo Pattern Recognition and Artificial Intelligence group Department.
WebMiningResearchASurvey Web Mining Research: A Survey Authors: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Computer Science Department University.
An Efficient Hindi-Urdu Transliteration System Nisar Ahmed PhD Scholar Department of Computer Science and Engineering, UET Lahore.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Using ODP Metadata to Personalize Search Presented by Lan Nie 09/21/2005, Lehigh University.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Data Mining, Neural Network and Genetic Programming
Authorship Attribution Using Probabilistic Context-Free Grammars
Web Crawling.
Efficient Estimation of Word Representation in Vector Space
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Technology Development
Project Tukaram Sagar Tamhane
Text Categorization Rong Jin.
Chapter 5: Information Retrieval and Web Search
Chapter 11 user support.
Centre For Indian Language Technology
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
Recuperação de Informação B
Information Retrieval and Web Design
Extracting Why Text Segment from Web Based on Grammar-gram
Introduction to UNICODE (ஒருங்குறி)
Presentation transcript:

Automatic Language Identification – A Syntactic Approach Mahesh Soundalgekar November 23, 2018 CFILT, IIT Bombay

The Road Map Introduction System Architecture Classification Approaches Experimental Results Summary and Future Work November 23, 2018 CFILT, IIT Bombay

Introduction Goal : Efficiently crawl Web pages in a given language; Marathi in our case Different languages use the same Devanagari script E.g Marathi, Sanskrit and Hindi Necessity to accurately distinguish one language from others We take a syntactic approach to solve this problem, which has given us excellent results on training data of 2MB with test data of 10 MB November 23, 2018 CFILT, IIT Bombay

Appropriate Encoding Converter System Architecture HTML Documents in different encodings such as Xdvng, DV-TTYogesh HTML to ASCII Plain Text + Font Information Appropriate Encoding Converter Plain Text in ISCII Encoding Classifier Classification Results November 23, 2018 CFILT, IIT Bombay

Classification Approaches Most Frequently Occurring Common Words e.g. English : the, an, is, at,a etc N-Grams (Most Frequent Character Sequences) Bi-grams: th, ’s, re, en Tri-grams: the, ing, ion, Quad-grams: tion as in classification, association, gratification etc. November 23, 2018 CFILT, IIT Bombay

Important Factors Size of the Training Data – Important to capture the syntactic essence of a language Domains of Training Data – Usages vary from domain to domain, author to author Size of the Test Data – Small test data may not contain enough information for classification Requirement of linguistic knowledge for common words approach November 23, 2018 CFILT, IIT Bombay

Classifier Architecture Training Samples Test Document Generate Profile Generate Profiles Category Profiles Document Profile Measure Profile Distances Find minimum Distance Identify category November 23, 2018 CFILT, IIT Bombay

Common Words Approach List of selected common words Matched with the test documents Closest match will give the language of the document Advantages: Intuitive Computationally Efficient Space Efficient November 23, 2018 CFILT, IIT Bombay

Top 5 Marathi Common Words ´É +ÉÎhÉ +É½ä ªÉÉ iÉä November 23, 2018 CFILT, IIT Bombay

N-Grams Approach JAVA Bi-grams: _J, JA, AV, VA, A_ Tri-grams: _JA, JAV, AVA, VA_, A__ Quad-grams: _JAV, JAVA, AVA_, VA__, A___ ¨ÉniÉ Bi-grams: _¨É, ¨Én, , niÉ, iÉ_ Tri-grams: _¨Én, ¨ÉniÉ, niÉ_, iÉ__ November 23, 2018 CFILT, IIT Bombay

Measuring Distances Out_of_Place () A ER ING AND ON AR AND ER ED ON max_value 2 1 Max_value Category profile sorted in descending order Test profile sorted in descending order Distance =3 + 2* max_value November 23, 2018 CFILT, IIT Bombay

Extensions to N-Grams Method Lowest Granularity +ÉÊniªÉ = + + É + Ê + n + iÉ + ªÉ Letter Granularity +ÉÊniªÉ = +É + Ên + iÉ + ªÉ Conjunct Granularity +ÉÊniªÉ = +É + Ên + iªÉ November 23, 2018 CFILT, IIT Bombay

Experimental Training Setup Language Total size of pages in KB No. of Pages Average size of a page in KB Marathi 700 46 15.2 Hindi 600 24 25 Sanskrit 560 19 29.5 November 23, 2018 CFILT, IIT Bombay

Category Profiles Generated through Training Language No. of handpicked Common Words No. of N-Grams in Atomic Approach Letter Approach Conjunct Approach Marathi 25 37633 63596 63580 Hindi 15450 26886 26865 Sanskrit 21 24119 45380 49368 November 23, 2018 CFILT, IIT Bombay

Classification Results Language Common Words Atomic Approach Letter Approach Conjunct Approach Marathi 91% 95% 100% Hindi 93% 80% 92% Sanskrit 86% 50% November 23, 2018 CFILT, IIT Bombay

Summary and Future Work Good results have been obtained through syntactic classification Common words technique is computationally most efficient, but with a lesser accuracy Our extensions to N-Grams give the desired accuracy N-grams technique is robust to syntax errors N-Grams technique does not require linguistic knowledge We will be Using language identification techniques to identify a good starting set of pages for crawling activities for the general purpose search engine November 23, 2018 CFILT, IIT Bombay