Presentation is loading. Please wait.

Presentation is loading. Please wait.

Problem Semi supervised sarcasm identification using SASI

Similar presentations


Presentation on theme: "Problem Semi supervised sarcasm identification using SASI"— Presentation transcript:

1 Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon

2 Problem Semi supervised sarcasm identification using SASI
Sarcasm: the activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone else feel stupid or show them that you are angry

3 Datasets Twitter Dataset: Tweets are 140 characters or fewer
Tweets can contain urls, references to other tweeters or hashtags #<tag> Slang, abbreviations, and emoticons are common 5.9 million tweets 14.2 average words per tweet 18.9% include a url, 35.3% 6.9% contain one or more hashtags

4 Datasets Amazon Dataset: 66,000 reviews of 120 products
953 characters on average Usually structured and grammatical Have fields including writer, date, rating, and summary Amazon reviews have a great deal of context compared to tweets

5 Classification The algorithm is semi-supervised
Seeded with a small group of labeled sentences The seed is annotated with a sarcasm ranking in [1,5] Syntatic and pattern based features are used to build a classifier

6 Data Preprocessing Specific information was replaced with general tags to facilitate pattern matching: ‘[PRODUCT]’,‘[COMPANY]’,‘[TITLE]’ ‘[AUTHOR]’ ‘[USER]’,‘[LINK]’ and ‘[HASHTAG]’. All HTML tags removed

7 Pattern Extraction and Selection
Words are classified into high frequency words (HFW) and (CW) A pattern is an ordered sequence of HFWs and slots for CWs “[COMPANY] CW does not CW much” Generated patterns were removed if they were present in two seeds with rankings 1 and 5 Patterns were removed which appear only in reference to a single product

8 Pattern Matching

9 Other Features (1) Sentence length in words,
(2) Number of “!” characters in the sentence (3) Number of “?” characters in the sentence (4) Number of quotes in the sentence (5) Number of capitalized/all capitals words in the sentence.

10 Data Enrichment Assumption: Sentences near a sarcastic sentence are similarly sarcastic Using the seed set for the Amazon data, perform a yahoo search for text snippets containing the seeds. Include the surrounding sentences in the seed, annotated similarly to the original search parameters

11 Classification Similar to kNN
The score for a new instance is the weighted average of the k nearest training set vectors, measured using Euclidean distance

12 Baseline Assume sarcasm implies saying the opposite of what you mean
Identify reviews with few stars and decide that sarcasm is present if strongly positive words appear in the review

13 Training Sets Amazon: Twitter 80 positive and 505 negative examples
(471/5020 expanded) Twitter 1500 #sarcasm hash tagged tweets (Noisy) Changed to be positive examples from the Amazon dataset and manually selected negative examples from the Twitter dataset

14 Test Sets 90 positive and 90 negative examples each for Amazon and Twitter Only sentences containing a named entity or named entity reference were sampled (more likely to contain sentiment → relevance) Non-sarcastic sentences belong only to negative reviews, increasing the chance that they contain negative sentiment MTurk used to create a gold standard for the test set. Each sentence was annotated by 3 annotators.

15 Inter-Annotator Agreement
Amazon: k = 0.34 Twitter: k = 0.41 Superior Twitter agreement is attributed to lack of context in the medium

16 Tables

17 Baseline Intuitions The Baseline has high precision, but low recall
It cannot recognize subtly sarcastic sentences These results imply that the definition “saying the opposite of what you mean” is not a good indicator of sarcasm

18 Reasons for Good Twitter Results
Robustness of sparse and incomplete pattern matching SASI learns a model with a feature space spanning over 300 dimensions Sarcasm may be easier to detect in tweets because tweeters have to go out of their way to make sarcasm explicit in an environment with no context

19 Notes #sarcasm tags were unreliable
Punctuation marks were the weakest predictors, in contrast to the findings of Teppermann et al. (2006) The exception to this is the use of ellipses, which was a strong predictor in combination with other features


Download ppt "Problem Semi supervised sarcasm identification using SASI"

Similar presentations


Ads by Google