Language Identification Ben King1/23June 12, 2013 Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods Ben King.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
POS Tagging & Chunking Sambhav Jain LTRC, IIIT Hyderabad.
Topic Modeling with Network Regularization Md Mustafizur Rahman.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Author Identification for LiveJournal Alyssa Liang.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Scalable Text Mining with Sparse Generative Models
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Tokenization & POS-Tagging
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Identifying Entity Relationships in News Reports 27. January 2010 Martin Jačala, Jozef Tvarožek Faculty of Informatics and Information Technology Slovak.
National Taiwan University, Taiwan
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Xinhao Wang, Jiazhong Nie, Dingsheng Luo, and Xihong Wu Speech and Hearing Research Center, Department of Machine Intelligence, Peking University September.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Natural Language Processing Statistical Inference: n-grams
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Model for Machine Translation Jang, HaYoung.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
Conditional Random Fields for ASR
CSC 594 Topics in AI – Natural Language Processing
Lecture 15: Text Classification & Naive Bayes
Conceptual grounding Nisheeth 26th March 2019.
Stance Classification of Ideological Debates
Presentation transcript:

Language Identification Ben King1/23June 12, 2013 Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods Ben King and Steven Abney University of Michigan

Language Identification Ben King2/23June 12, 2013 Language identification background Language identification is one of the older problems in NLP – Especially in regards to spoken language Performance in this task tends to be quite high (>99% accuracy) Most previous formulations assume monolingual documents

Language Identification Ben King3/23June 12, 2013 Problem Background We were trying to replicate An Crúbadán (Scannell, 2007) – Crawls the web to build corpora for minority languages – Problem: most pages retrieved have multiple languages mixed together

Language Identification Ben King4/23June 12, 2013 Problem Definition Input: – Plain text documents with multiple languages mixed – The names of the two languages present

Language Identification Ben King5/23June 12, 2013 Problem Definition Output: – A language tag for every word in the document

Language Identification Ben King6/23June 12, 2013 Problem Definition Training data: – Small monolingual samples of 643 languages – Approximately 1700 words on average

Language Identification Ben King7/23June 12, 2013 Problem Definition Q: what makes this problem interesting? A: its weakly supervised nature – The training data and the testing data are of different types – Many properties do not generalize across documents

Language Identification Ben King8/23June 12, 2013 Contribution of this work In 2006, Hughes et al. published a survey of language identification and suggested 11 areas of future work This project covers three: – Supporting minority languages – Sparse training data – Multilingual documents

Language Identification Ben King9/23June 12, 2013 Test corpus creation Following An Crúbadán, we build a test corpus of mixed-language documents from the Web Using the Bootcat tool (Baroni and Bernardini, 2004), we search the web for foreign words Sotho Find documents with: Search the web for: “tsa”, “ohle”, “ya”, “ke” Automatically and manually filter the result set

Language Identification Ben King10/23June 12, 2013 Test corpus creation Our test corpus contains – Over 250K words – 30 non-English languages Corpus is available for download at mixed-language-annotations-release-v1.0.tgz

Language Identification Ben King11/23June 12, 2013 Test corpus creation Language# of wordsLanguage# of words Azerbaijani Banjar Basque Cebuano Chippewa Cornish Croatian Czech Faroese Fulfulde Hausa Hungarian Igbo Kiribati Kurdish Lingala Lombard Malagasy Nahuatl Ojibwa Oromo Pular Serbian Slovak Somali Sotho Tswana Uzbek Yoruba Zulu

Language Identification Ben King12/23June 12, 2013 Test corpus annotation Each document was manually annotated according to language

Language Identification Ben King13/23June 12, 2013 Approach We found many possible reasons why a webpage might contain multiple languages – Code-switching – Multiple authors who speak different languages – An English platform for non-English blogs Our machine learning approach doesn’t assume any specific process

Language Identification Ben King14/23June 12, 2013 Features Character n-grams Full word Non-word characters between words horse Unigrams “h”, “o”, “r”, “s”, “e” Bigrams “_h”, “ho”, “or”, “rs”, “se”, “e_” Trigrams “_ho”, “hor”, “ors”, “rse”, “se_” 4-grams “_hor”, “hors”, “orse”, “rse_” 5-grams “_hors”, “horse”, “orse_” Full Word “horse” the horse, ‘94 bred Before “space_present” After “comma_present” “space_present” “apostrophe_present” “9_present” “4_present”

Language Identification Ben King15/23June 12, 2013 Methods – CRF with GE

Language Identification Ben King16/23June 12, 2013 Methods – CRF with GE “tre” English: 0.75 Sotho: 0.25 Training Data Testing Data Eng:Sot = 2:1 English: 83% Sotho: 17%

Language Identification Ben King17/23June 12, 2013 Methods – HMM with EM Hidden Markov Model trained with Expectation Maximization – Initialize the emission probabilities using a Naïve Bayes classifier, transition probabilities uniform – E-step: label the document with the current HMM – M-step: re-estimate the transition and emission probabilities from the labeled document

Language Identification Ben King18/23June 12, 2013 Methods Baselines: – Logistic Regression trained with Generalized Expectation – Naïve Bayes classifier

Language Identification Ben King19/23June 12, 2013 Results

Language Identification Ben King20/23June 12, 2013 Discussion CRF with GE is consistently accurate across different amounts of training data – But its learning curve looks kind of strange – There is some evidence that the CRF is being over- constrained

Language Identification Ben King21/23June 12, 2013 Discussion As the size of the training data grows, the number of unique features grows – But all constraints in GE are equally important With pruning we may be able to get even better performance from the CRF “tre” “kga” Occurs 132 times English: 85% Sotho: 15% Occurs 1 time English: 0% Sotho: 100% May not generalize well!

Language Identification Ben King22/23June 12, 2013 Future Work We would like to not have to rely on user- provided labels – We are working on a system that can analyze an unknown document and identify the set of languages present – That system could be the first stage of a pipeline that includes this work

Language Identification Ben King23/23June 12, 2013 Questions?