Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon
Problem Semi supervised sarcasm identification using SASI Sarcasm: the activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone else feel stupid or show them that you are angry
Datasets Twitter Dataset: Tweets are 140 characters or fewer Tweets can contain urls, references to other tweeters (@<user>) or hashtags #<tag> Slang, abbreviations, and emoticons are common 5.9 million tweets 14.2 average words per tweet 18.9% include a url, 35.3% contain @<user> 6.9% contain one or more hashtags
Datasets Amazon Dataset: 66,000 reviews of 120 products 953 characters on average Usually structured and grammatical Have fields including writer, date, rating, and summary Amazon reviews have a great deal of context compared to tweets
Classification The algorithm is semi-supervised Seeded with a small group of labeled sentences The seed is annotated with a sarcasm ranking in [1,5] Syntatic and pattern based features are used to build a classifier
Data Preprocessing Specific information was replaced with general tags to facilitate pattern matching: ‘[PRODUCT]’,‘[COMPANY]’,‘[TITLE]’ ‘[AUTHOR]’ ‘[USER]’,‘[LINK]’ and ‘[HASHTAG]’. All HTML tags removed
Pattern Extraction and Selection Words are classified into high frequency words (HFW) and (CW) A pattern is an ordered sequence of HFWs and slots for CWs “[COMPANY] CW does not CW much” Generated patterns were removed if they were present in two seeds with rankings 1 and 5 Patterns were removed which appear only in reference to a single product
Pattern Matching
Other Features (1) Sentence length in words, (2) Number of “!” characters in the sentence (3) Number of “?” characters in the sentence (4) Number of quotes in the sentence (5) Number of capitalized/all capitals words in the sentence.
Data Enrichment Assumption: Sentences near a sarcastic sentence are similarly sarcastic Using the seed set for the Amazon data, perform a yahoo search for text snippets containing the seeds. Include the surrounding sentences in the seed, annotated similarly to the original search parameters
Classification Similar to kNN The score for a new instance is the weighted average of the k nearest training set vectors, measured using Euclidean distance
Baseline Assume sarcasm implies saying the opposite of what you mean Identify reviews with few stars and decide that sarcasm is present if strongly positive words appear in the review
Training Sets Amazon: Twitter 80 positive and 505 negative examples (471/5020 expanded) Twitter 1500 #sarcasm hash tagged tweets (Noisy) Changed to be positive examples from the Amazon dataset and manually selected negative examples from the Twitter dataset
Test Sets 90 positive and 90 negative examples each for Amazon and Twitter Only sentences containing a named entity or named entity reference were sampled (more likely to contain sentiment → relevance) Non-sarcastic sentences belong only to negative reviews, increasing the chance that they contain negative sentiment MTurk used to create a gold standard for the test set. Each sentence was annotated by 3 annotators.
Inter-Annotator Agreement Amazon: k = 0.34 Twitter: k = 0.41 Superior Twitter agreement is attributed to lack of context in the medium
Tables
Baseline Intuitions The Baseline has high precision, but low recall It cannot recognize subtly sarcastic sentences These results imply that the definition “saying the opposite of what you mean” is not a good indicator of sarcasm
Reasons for Good Twitter Results Robustness of sparse and incomplete pattern matching SASI learns a model with a feature space spanning over 300 dimensions Sarcasm may be easier to detect in tweets because tweeters have to go out of their way to make sarcasm explicit in an environment with no context
Notes #sarcasm tags were unreliable Punctuation marks were the weakest predictors, in contrast to the findings of Teppermann et al. (2006) The exception to this is the use of ellipses, which was a strong predictor in combination with other features