Twitter as a Corpus for Sentiment Analysis and Opinion Mining Adam Rosenberg, Leandra Irvine, Gus Logsdon
Twitter: New Huge Microblogging Platform 2010
Twitter… Daily Life Variety of People
Twitter is Useful! Politics Marketing Sentiment Analysis
Contributions A method to collect a corpus with positive and negative sentiments, and a corpus of objective texts such that no human effort is needed for classifying the documents. Performing statistical linguistic analysis of the collected corpus. Using the collected corpora to build a sentiment classification system for microblogging. Conducting experimental evaluations on a set of real microblogging posts to prove that our presented technique is efficient and performs better than previously proposed methods.
Previous Work Pang and Lee, 2008 – survey of field, very little use of microblogging Yang et al., 2007 – analyzed blog corpus using Support Vector Machines and CRF learners Read, 2005 – Usenet group corpus with SVMs and Naïve Bayes Go et al., 2009 – Used Twitter corpus with SVMs and Naïve Bayes. Achieved 81% accuracy
Corpus Collection Three sentiment classes in Twitter data: positive, negative, and objective (neutral). Tweets containing happy emoticons, i.e. , and tweets containing sad emoticons, i.e. , were included in the positive and negative training data, respectively. (Read, 2005; Go et al., 2009) Objective posts retrieved from well-known news sources such as New York Times and Washington Post. Assumption: emoticon summarizes the sentiment of the entire tweet. All tweets collected were in English.
Corpus Analysis Follows Zipf’s Law POS tagging using TreeTagger Compare variations of POS among 3 main sets
POS-Tagging Pairwise Comparison Objective vs Subjective Interjections Comparative Adjectives “ha” “yay” “wow” “more” “less” Superlative Adjectives 3rd person past participle “most” “least” “he has taken” “she has eaten” 1st and 2nd person simple verbs Common and Proper Nouns “girl” “Bob” “officer” “I take” “you eat” Personal Pronouns “you” “us” “me”
POS-Tagging Pairwise Comparison Positive vs Negative Superlative Adverbs Past tense verbs “most” “best” Possessive ending “missed” “lost” “stuck” “friend’s” Whose Misspelling of “who’s” “taken” “bored” “gone”
Feature Extraction Used the presence of n-grams as binary features, ignored frequency. Which n-gram model best captures Twitter post sentiments? Filtering: remove URL links, Twitter user names, and emoticons. Tokenization: form a bag of words by splitting text into smaller units. Stopword removal: removed articles such as “the” from bag of words. Constructing n-grams: create set of n-grams from consecutive words. For better accuracy, attach negations such as “not” to the word that they modify. Negations highly influence the sentiment of the expression. (Wilson et al., 2005).
Classifier Used the Naïve Bayes classifier to determine the sentiment: 𝑃 𝑠 𝑀 = 𝑃 𝑠 𝑃 𝑀 𝑠 𝑃(𝑀) Equal number of messages in each sentiment, so simplifies to: 𝑃 𝑠 𝑀 = 𝑃 𝑀 𝑠 𝑃(𝑀) 𝑃 𝑠 𝑀 ~𝑃(𝑀|𝑠)
Classifier (Cont.) Two Bayes classifiers, one based on the presence of n-grams and the other based on the presence of POS-tags in the message. Let G be a set of n-grams representing the message and T be a set of POS-tags of the message. Mathematically: 𝑃 𝑠 𝑀 ~𝑃 𝐺 𝑠 𝑃 𝑇 𝑠 𝑃 𝐺 𝑠 = 𝑔∈𝐺 𝑃(𝑔|𝑠)
Classifier (Cont.) 𝑃 𝑇 𝑠 = 𝑡∈𝑇 𝑃(𝑡|𝑠) 𝑃 𝑠 𝑀 ~ 𝑔∈𝐺 𝑃 𝑔 𝑠 𝑡∈𝑇 𝑃(𝑡|𝑠) 𝐿 𝑠 𝑀 = 𝑔∈𝐺 log 𝑃 𝑔 𝑠 + 𝑡∈𝑇 log(𝑃(𝑡|𝑠)) Substitution Log-likelihood!
Higher Accuracy To account for statistical noise in the data (headwords, stopwords, etc.) this method uses a couple of new factors: Entropy - higher Shannon entropy → less able to distinguish between sentiments Salience - higher salience → more biased towards one sentiment or another
High Entropy/Salience Examples
Using Entropy and Salience We can set thresholds for Entropy and Salience. Then we can throw those thresholds into the term log- likelihood equation from before.
Results and Evaluation Use an F-measure harmonic mean to Evaluate Instead of precision and recall, the authors use the terms “accuracy” (% correct guesses) and “decision” (retrieved/all) here instead Where β= 0.5
Conclusion Opinion mining and sentiment analysis using microblogging corpora and Naïve Bayes can be automated using emoticons to gauge sentiment with high accuracy