Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Sidra: a Flexible Distributed Indexing and Ranking Architecture for Web Search Miguel Costa, Mário J. Silva Universidade de Lisboa, Faculdade de Ciências,
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Using the Semantic Web for Web Searches Norman Piedade de Noronha, Mário J. Silva XLDB / LaSIGE, Faculdade de Ciências, Universidade de Lisboa.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
1 Advanced Smoothing, Evaluation of Language Models.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1.
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Web Design. How do web pages work? Webpages are written in a code called HTML. Programs like Internet Explorer read the code, and then show it as a web.
Tokenization & POS-Tagging
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
HANGMAN OPTIMIZATION Kyle Anderson, Sean Barton and Brandyn Deffinbaugh.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Scanned Documents INST 734 Module 10 Doug Oard. Agenda Document image retrieval Representation  Retrieval Thanks for David Doermann for most of these.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
Using Translation Memory to Speed up Translation Process
Automatic Language Identification – A Syntactic Approach
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Presentation transcript:

Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK (DE-ACM-SAC-2005)

Motivation ● Goal: Efficiently crawl web pages in a given language, Portuguese in our case. ● Necessity to accurately distinguish one language from others. We take a n-gram based approach to solve this problem, which has been reported to give excellent results.

Problems ● Web texts are considerably different: – Multilingual documents. – Spelling errors. – Lack of coherent sentences. – Often small amounts of textual data. These considerable differences motivate a revisit to the problem.

Outline ● Introduction. ● Context and Related Work. – Language identification. – Text categorization with n-grams. ● Our Language Identification Algorithm. ● Experimental Results. ● Future Work. ● Conclusions.

Language Identification ● Sibun and Reynar provided a good survey. ● Variety of features have been tried: – Characters, words, POS tags, n-grams,... N-gram based methods seem to be the most promising. ● Dunning, Damashek, Cavnar & Trenkle,...

N-grams in text categorization N-grams = n-character slices of a longer string. ● “tumba!” is composed of the following n-grams: – Unigrams: _, t, u, m, b, a, !, _ – Bigrams: _t, tu, um, mb, ba, a!, !_ – Trigrams: _tu, tum, umb, mba, ba!, a!_, !__ – Quadgrams: _tum, tumb, umba, mba!, ba!_, a!__, !___ – Quintgrams: _tumb, tumba, umba!, mba!_, ba!__, a!___, !____ ● Advantages: – Efficiently handle spelling and grammatical errors. – No need for tokenization, stemming,... – Computationally and space efficient.

Outline ● Introduction. ● Context and Related Work. ● Our Language Identification Algorithm. – N-gram categorization approach. – Measuring similarity with n-gram profiles. – Heuristics for Web documents. ● Experimental Results. ● Future Work. ● Conclusions.

N-gram categorization approach ● Measure similarity among documents through n-gram statistics. ● N-grams of multiple lengths simultaneously (1-5)

N-gram similarity - Cavnar & Treckle

More efficient similarity measures ● Lin's information theoretic similarity measure: ● Jiang and Conranth's distance formula:

Heuristics for the Web ● Use meta-data information, if available and valid. – Matching strings on the language meta tag. ● Filter common or automatically generated strings. – “optimized for Internet Explorer” ● Weight n-grams according to HTML markup. – Title, bold typeface, subject and description metatags ● Handle insufficient data. – Ignore pages with less 40 characters. ● Handle multilingualism and hard to decide cases. – Weight largest sentences.

Outline ● Introduction. ● Context and Related Work. ● Our Language Identification Algorithm. ● Experimental Results. ● Future Work. ● Conclusions.

Evaluation Experiments ● Language profiles for 23 different languages. ● Test collection: 500 documents for each of 12 different languages. ● HTML documents crawled from portals and online newspapers. ● Tested the classification algorithm in different settings. ● Lin's measure was the most accurate. ● Heuristics improve performance.

Evaluation Results

Application to the Portuguese Web About 3.5 million pages. Multiple file types. Significant portion of the Portuguese Web is written in foreign languages, especially English.

Limitations ● Unable to distinguish dialects of the same language? – Portuguese from Portugal and from Brazil. – English and American English? ● Possible directions: – Web linkage information. – “Discriminative” n-grams instead of most frequent.

Future Work ● Carefully choose better training data. ● Smoothing (Good-Turing). ● Use n-grams approach for other classification tasks.

Conclusions ● N-grams are effective in language guessing. ● Text from the Web presents problems. ● Lin's similarity measure seems effective.

Thanks for your attention!