Normalizing Microtext Zhenzhen Xue, Dawei Yin and Brian D. Davison Lehigh University.

Slides:



Advertisements
Similar presentations
1 Noisy Text Analytics: An Exercise in Futility? Hwee Tou Ng Department of Computer Science National University of Singapore 8 Jan 2007.
Advertisements

Customized Spell Corrector
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Problem Semi supervised sarcasm identification using SASI
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Normalization of SMS Text Gurpreet Singh Khanuja Sachin Yadav.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
Using Web Queries for Learner Error Detection Michael Gamon, Microsoft Research Claudia Leacock, Butler-Hill Group.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Machine Transliteration Bhargava Reddy B.Tech 4 th year UG.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,
Christopher Johnson.  What is Computer Mediated Communication (CMC)? ◦ Short Message Service ◦ Blogs (Twitter) ◦ ◦ Instant messages  Observed.
Research © 2008 Yahoo! Generating Succinct Titles for Web URLs Kunal Punera joint work with Deepayan Chakrabarti and Ravi Kumar Yahoo! Research.
A Unified and Discriminative Model for Query Refinement Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Detection of Spelling Errors in Swedish Clinical Text Nizamuddin Uddin and Hercules Dalianis Department of Computer and Systems Sciences, (DSV)
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
On using context for automatic correction of non-word misspellings in student essays Michael Flor Yoko Futagi Educational Testing Service 2012 ACL.
Lukáš Neumann and Jiří Matas Centre for Machine Perception, Department of Cybernetics Czech Technical University, Prague 1.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Using the Web for Language Independent Spellchecking and Auto correction Authors: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis Google Inc. Published.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Approaches to Machine Translation
Linguistic knowledge for Speech recognition
Grey Sentiment Analysis
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
10/13/2017.
Summary Presented by : Aishwarya Deep Shukla
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
--Mengxue Zhang, Qingyang Li
CS621/CS449 Artificial Intelligence Lecture Notes
Approaches to Machine Translation
CSA3180: Natural Language Processing
Using Uneven Margins SVM and Perceptron for IE
Giuseppe Attardi Dipartimento di Informatica Università di Pisa
Presentation transcript:

Normalizing Microtext Zhenzhen Xue, Dawei Yin and Brian D. Davison Lehigh University

Introduction  Microtext  Informal written text  Limited length  E.g., Tweets, texting messages, online chat messages  High proportion of Non-Standard Words (NSWs)  Misspellings, slang, abbreviations, etc.  Difficult for machines (and sometimes humans!) to understand  Challenge for NLP algorithms  Tasks such as Named Entity Recognition have been reported to have lower performance on Microtext than on structured text 2The AAAI-11 Workshop on Analyzing Microtext

Examples The AAAI-11 Workshop on Analyzing Microtext3  Hard to define the boundary between NSWs and Standard Words  Widely used acronyms (e.g., PhD, ATM, CD, BMW, etc) Non Standard WordsStandard Words c u 2niteSee you tonight l8tlate msgmessage plzplease enufenough goooooodgood wrkwork Lollaugh out loud asapAs soon as possible

Related Work  General Text Normalization  Deals with tokens such as numbers, abbreviations, dates, currency amounts, etc  Spelling Correction  Relies on spelling lexicon to detect misspelled tokens  SMS Normalization  Three approaches  Spelling correction (Choudhury et al. 2007; Cook and Stevenson 2009)  Machine translation (Aw et al. 2006)  Automatic speech recognition (Kobus et al. 2008) 4The AAAI-11 Workshop on Analyzing Microtext

Normalization Model (1)  Assuming the Noisy Channel Model  Find the most probable normalization t* for each observed term t’ 5The AAAI-11 Workshop on Analyzing Microtext

Normalization Model (2)  Multiple channels  Grapheme Channel  Phoneme Channel  Context Channel  Acronym Channel 6The AAAI-11 Workshop on Analyzing Microtext Channel Model The probability of term t being transferred through channel c k

Channel Model – Grapheme Channel  Models the distortion of spellings  Channel Model 7The AAAI-11 Workshop on Analyzing Microtext Consonant Skeleton Longest Common Subsequence Ratio Levenshtein Distance

Channel Model – Grapheme Channel  Models the distortion of spellings  Channel Model 8The AAAI-11 Workshop on Analyzing Microtext EXAMPLE Word: Plz Candidate: Please LCSRatio: 2/6 CS(Word): Plz CS(Candidate): Pls Distance: 1 substitution Similarity: 1/3 EXAMPLE Word: Plz Candidate: Please LCSRatio: 2/6 CS(Word): Plz CS(Candidate): Pls Distance: 1 substitution Similarity: 1/3 EXAMPLE Word: Plz Candidate: Pulls LCSRatio: 1/5 CS(Word): Plz CS(Candidate): Pls Distance: 1 substitution Similarity: 1/5 EXAMPLE Word: Plz Candidate: Pulls LCSRatio: 1/5 CS(Word): Plz CS(Candidate): Pls Distance: 1 substitution Similarity: 1/5

Channel Model – Phoneme Channel  Models the distortion of pronunciations  Channel Model 9The AAAI-11 Workshop on Analyzing Microtext Letter-To-Phoneme conversion * * Rama, T.; Singh, A. K.; and Kolachina, S Modeling letter-to-phoneme conversion as a phrase based statistical machine translation problem with minimum error rate training. In NAACL ’09.

Channel Model – Context Channel  Incorporates context information  Channel Model 10The AAAI-11 Workshop on Analyzing Microtext Observed WordCandidatesContexts & Normalizations yryour, yearIt’s yr (your) turn Happy new yr (year) wllwill, well, wallI wll (will) leave binbin, beenHave bin (been) really happy since… nono, knowDon’t no (know) why Context probability based on language models (e.g., Microsoft Web N-gram Service)

Channel Model – Acronym Channel The AAAI-11 Workshop on Analyzing Microtext11  Word to phrase (one-to-many) normalization  E.g., ASAP, FYI, LOL  Channel Model A set of popular acronyms collected from ref/textmessageabbreviations.asp

Channel Probabilities – Generic The AAAI-11 Workshop on Analyzing Microtext12  Term independent channel probability

Channel Probabilities – Term Dependent The AAAI-11 Workshop on Analyzing Microtext13  Learn term dependent channel probabilities from training data  Objective function

Combining Channels The AAAI-11 Workshop on Analyzing Microtext14  Objective function

Experiments  Two Datasets  818 Tweets collected in Sept  254 (31%) Tweets need normalizations  A total of 479 word transformations  A public SMS dataset  A training set of 854 text messages, and a test set of 138 messages  127 (92%) messages need normalization in the test set  An average of 4 word transformations per message  Comparison algorithms  Aspell  Moses (a competitive machine translation toolkit) 15The AAAI-11 Workshop on Analyzing Microtext

Results AccuracyF-measurePrecisionRecall Twitter Dataset Default0.70N/A Aspell Moses MC-TD* SMS Dataset Default0.60N/A Aspell Choudhury070.86N/A Moses MC-TD The AAAI-11 Workshop on Analyzing Microtext *MC-TD: Our Multi-Channel Model with Term-Dependent Probabilities

Importance of The Three Channels The AAAI-11 Workshop on Analyzing Microtext17

Conclusions and Future Work The AAAI-11 Workshop on Analyzing Microtext18  The multi-channel model outperforms baseline algorithms on two typical datasets  All four factors—orthographic, phonetic, contextual information and acronym expansion—are important  Optimization of Levenshtein Distance operator weights  In both Grapheme and Phoneme channels  Look at dialects—differences in NSWs based on geography or other sub-population  Similar work was published by Han and Baldwin in ACL this summer—useful to compare under same data

Thank You! Brian D. Davison