Applicability of N-Grams to Data Classification A review of 3 NLP-related papers Presented by Andrei Missine (CS 825, Fall 2003)

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Advertisements

Farag Saad i-KNOW 2014 Graz- Austria,

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Named Entity Classification Chioma Osondu & Wei Wei.

Sentiment Analysis An Overview of Concepts and Selected Techniques.

A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.

A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)

Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.

The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.

Presented by Zeehasham Rasheed

Faculty of Computer Science © 2006 CMPUT 605March 31, 2008 Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

1 Advanced Smoothing, Evaluation of Language Models.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Ngram Models Bahareh Sarrafzadeh Winter Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.

Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

Sentiment Detection Naveen Sharma( ) PrateekChoudhary( ) Yashpal Meena( ) Under guidance Of Prof. Pushpak Bhattacharya.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Bo Pang , Lillian Lee Department of Computer Science

1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

CSC 594 Topics in AI – Text Mining and Analytics

MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.

Commonsense Reasoning in and over Natural Language Hugo Liu, Push Singh Media Laboratory of MIT The 8 th International Conference on Knowledge- Based Intelligent.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Natural Language Processing Statistical Inference: n-grams

Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks EMNLP 2008 Rion Snow CS Stanford Brendan O’Connor Dolores.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task Magdalena Jankowska, Vlado Kešelj and Evangelos.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.

By: Shannon Silessi Gender Identification of SMS Texts.

Thumbs up? Sentiment Classification using Machine Learning Techniques Jason Lewris, Don Chesworth “Okay, I’m really ashamed of it, but I enjoyed it. I.

Language Model for Machine Translation Jang, HaYoung.

Topics in Computer Vision: Machine Learning Prof. Paulo Cezar Pinto Carvalho Final project presentation – Alexandre Chapiro.

N-gram Models Computational Linguistic Course

An Overview of Concepts and Selected Techniques

iSRD Spam Review Detection with Imbalanced Data Distributions

Introduction to Sentiment Analysis

Text Analytics Solutions with Azure Machine Learning

Presentation transcript:

Applicability of N-Grams to Data Classification A review of 3 NLP-related papers Presented by Andrei Missine (CS 825, Fall 2003)

What are N-Grams? Sequences of words or tokens from a corpus. Used to predict the probability of a word W being the next word given 0 – (n - 1) words before it. Common N-grams: unigrams, bigrams, trigrams and four-grams. One of the simpler statistical models used in NLP.

N-Grams and Authorship Attribution Authorship Attribution is the process of determining who the author of a given text is. An approach suggested by the authors of this paper (1) is to parse a known document written by an author A 1 on the byte level and to extract n-grams. The most frequent n-grams are then saved as the author profile for this author (A 1 ). This process is repeated for all other authors (A 2 – A n ). We now have a collection of author profiles. Given a new text it is compared versus the existing profiles and the one with the smallest dissimilarity is chosen as the most likely author. (1) “N-Gram-based Author Profiles for Authorship Attribution”

N-Grams on Byte Level? Instead of treating text as a collection of words, just look at the bytes. No modifications to the algorithm are required when switching between languages. The good side: the experiment performed with 100% (2) accuracy for English and 97% (2) accuracy for Greek data. This is much better than any of the previously attempted methods. The bad side: this approach did worse on Chinese data, performing with 89% (2) accuracy (previously achieved accuracy is 94%). A likely reason for this is because many Asian languages use Unicode (2 bytes) to encode characters, so some n-grams might include only half of a character. (2) Best achieved accuracy

N-Grams and Sentiment Classification In this particular paper (3) the authors discuss how N-Grams and machine learning can be applied to classifying movie reviews as positive or negative. The main reasons why movie reviews were chosen are their wide availability, ease of programmatically determining if the review is positive or negative (e.g. by the number of stars)and finally, the large availability of different reviewers. Some preliminary results: the chance of guessing the classification is 50%. When 2 graduate computer science were asked to provide a list of positive and negative words the results were 58% and 64% accurate. Finally, when a statistical method was applied to get such a list the rate of accuracy was 69%. (3) “ Thumbs up? Sentiment Classification using Machine Learning Techniques”

N-Grams and Sentiment Classification (continued) So how well did machine learning do? Naïve Bayesian classification has the best performance of 81.5% when unigrams and Parts of Speech (4) are used. Maximum Entropy classification has slightly lower best performance at 81.0% when top 2633 unigrams are chosen. Support Vector Machines have the best overall performance of the three, with the highest being 82.9% achieved when unigrams were used. Notes: The data was acquired from the corpus collected from IMDb. Interestingly, the presence of the n-grams appears to be more important than their frequency in this application. (4) As mentioned by authors “crude form of sense disambiguation”

N-Grams and Sentiment Classification (continued) - Problems Why is machine learning not doing so well on some articles? Sometimes considering just the N-grams is not enough – one needs to look at the broader context in which they are used. One of the examples provided by the authors is “thwarted expectations” where the reviewer goes on describing how great the movie should have been, and finishes with a quick comment on how bad it turned out. In this case there will be a larger amount of positive information and only a small bit of negative and the article might wrongfully get a positive rating. The converse of the above is also true: an article might wrongfully get a negative rating on a positive review such as “It was sick, disgusting and disturbing… It was great!” (5) (5) Same idea as the “Spice Girls” review in the paper

Affect Sensing on the Sentence Level The last approach (6) I examined is based on affect sensing by trying to apply well known facts to a sentence and thus detecting the overall mood. Source of common-sense information used was Open Mind Common Sense which has ~ 500,000 sentences in its corpus. Some simple linguistic models were used in conjunction with a smoothing model which would be responsible for determining how the mood was carried over from one sentence to the next. These were combined to produce an client which would attempt to react emotionally (via a simple drawing of a face) to the user’s text. The approach used by the authors is different from N-grams. (6) “A Model of Textual Affect Sensing using Real-World Knowledge”

Affect Sensing versus N-Grams Can be used to provide the user with a friendlier and more natural interface. The structure proposed by the authors can handle negations and slightly trickier linguistic structures than most simple n- gram based approaches. Can use common sense to infer more information than n- grams. Comes at a price of much more complicated algorithms and dependency on language-specific sources such as OMCS. Affect sensing is very young and was not evaluated thoroughly whereas n-grams have been around for some time and are well studied. Final note: Neither can handle sarcasm: “Yeah, right”.

References “N-gram-based Author Profiles for Authorship Attribution” by Keselj Vlado, Peng Fuchun, Cercone Nick and Thomas Calvin. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING'03, Dalhousie University, Halifax, Nova Scotia, Canada, August “Thumbs up? Sentiment Classification using Machine Learning Techniques” (2002) by Bo Pang, Lillian Lee, Shivakumar Vaithyanathan Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) “A Model of Textual Affect Sensing using Real-World Knowledge” by Hugo Liu, Henry Lieberman and Ted Selker. International Conference on Intelligent User Interfaces (IUI 2003). Miami, Florida “Foundations of Statistical Natural Language Processing”, by Christopher D. Manning and Hinrich Schutze