Applicability of N-Grams to Data Classification A review of 3 NLP-related papers Presented by Andrei Missine (CS 825, Fall 2003)
What are N-Grams? Sequences of words or tokens from a corpus. Used to predict the probability of a word W being the next word given 0 – (n - 1) words before it. Common N-grams: unigrams, bigrams, trigrams and four-grams. One of the simpler statistical models used in NLP.
N-Grams and Authorship Attribution Authorship Attribution is the process of determining who the author of a given text is. An approach suggested by the authors of this paper (1) is to parse a known document written by an author A 1 on the byte level and to extract n-grams. The most frequent n-grams are then saved as the author profile for this author (A 1 ). This process is repeated for all other authors (A 2 – A n ). We now have a collection of author profiles. Given a new text it is compared versus the existing profiles and the one with the smallest dissimilarity is chosen as the most likely author. (1) “N-Gram-based Author Profiles for Authorship Attribution”
N-Grams on Byte Level? Instead of treating text as a collection of words, just look at the bytes. No modifications to the algorithm are required when switching between languages. The good side: the experiment performed with 100% (2) accuracy for English and 97% (2) accuracy for Greek data. This is much better than any of the previously attempted methods. The bad side: this approach did worse on Chinese data, performing with 89% (2) accuracy (previously achieved accuracy is 94%). A likely reason for this is because many Asian languages use Unicode (2 bytes) to encode characters, so some n-grams might include only half of a character. (2) Best achieved accuracy
N-Grams and Sentiment Classification In this particular paper (3) the authors discuss how N-Grams and machine learning can be applied to classifying movie reviews as positive or negative. The main reasons why movie reviews were chosen are their wide availability, ease of programmatically determining if the review is positive or negative (e.g. by the number of stars)and finally, the large availability of different reviewers. Some preliminary results: the chance of guessing the classification is 50%. When 2 graduate computer science were asked to provide a list of positive and negative words the results were 58% and 64% accurate. Finally, when a statistical method was applied to get such a list the rate of accuracy was 69%. (3) “ Thumbs up? Sentiment Classification using Machine Learning Techniques”
N-Grams and Sentiment Classification (continued) So how well did machine learning do? Naïve Bayesian classification has the best performance of 81.5% when unigrams and Parts of Speech (4) are used. Maximum Entropy classification has slightly lower best performance at 81.0% when top 2633 unigrams are chosen. Support Vector Machines have the best overall performance of the three, with the highest being 82.9% achieved when unigrams were used. Notes: The data was acquired from the corpus collected from IMDb. Interestingly, the presence of the n-grams appears to be more important than their frequency in this application. (4) As mentioned by authors “crude form of sense disambiguation”
N-Grams and Sentiment Classification (continued) - Problems Why is machine learning not doing so well on some articles? Sometimes considering just the N-grams is not enough – one needs to look at the broader context in which they are used. One of the examples provided by the authors is “thwarted expectations” where the reviewer goes on describing how great the movie should have been, and finishes with a quick comment on how bad it turned out. In this case there will be a larger amount of positive information and only a small bit of negative and the article might wrongfully get a positive rating. The converse of the above is also true: an article might wrongfully get a negative rating on a positive review such as “It was sick, disgusting and disturbing… It was great!” (5) (5) Same idea as the “Spice Girls” review in the paper
Affect Sensing on the Sentence Level The last approach (6) I examined is based on affect sensing by trying to apply well known facts to a sentence and thus detecting the overall mood. Source of common-sense information used was Open Mind Common Sense which has ~ 500,000 sentences in its corpus. Some simple linguistic models were used in conjunction with a smoothing model which would be responsible for determining how the mood was carried over from one sentence to the next. These were combined to produce an client which would attempt to react emotionally (via a simple drawing of a face) to the user’s text. The approach used by the authors is different from N-grams. (6) “A Model of Textual Affect Sensing using Real-World Knowledge”
Affect Sensing versus N-Grams Can be used to provide the user with a friendlier and more natural interface. The structure proposed by the authors can handle negations and slightly trickier linguistic structures than most simple n- gram based approaches. Can use common sense to infer more information than n- grams. Comes at a price of much more complicated algorithms and dependency on language-specific sources such as OMCS. Affect sensing is very young and was not evaluated thoroughly whereas n-grams have been around for some time and are well studied. Final note: Neither can handle sarcasm: “Yeah, right”.
References “N-gram-based Author Profiles for Authorship Attribution” by Keselj Vlado, Peng Fuchun, Cercone Nick and Thomas Calvin. In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING'03, Dalhousie University, Halifax, Nova Scotia, Canada, August “Thumbs up? Sentiment Classification using Machine Learning Techniques” (2002) by Bo Pang, Lillian Lee, Shivakumar Vaithyanathan Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) “A Model of Textual Affect Sensing using Real-World Knowledge” by Hugo Liu, Henry Lieberman and Ted Selker. International Conference on Intelligent User Interfaces (IUI 2003). Miami, Florida “Foundations of Statistical Natural Language Processing”, by Christopher D. Manning and Hinrich Schutze