Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu

Problem Semi-supervised identification of sarcasm in datasets from popular sites such as Twitter and Amazon. What is Sarcasm? The activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone else feel stupid or show them that you are angry. Example: “Wow GPRS data speeds are blazing fast.” (Twitter)

Datasets Twitter Dataset: Tweets contain at most 140 characters Tweets may contain urls, references to other tweeters (@ ) or hashtags # Slang, abbreviations, and emoticons are common 5.9 million tweets 14.2 average words per tweet 18.7% include a url, 35.3% contain @ 6.9% contain at least one hashtags

Datasets Amazon Dataset: 66,000 reviews for 120 products that includes books and electronics 953 characters on average Usually structured and grammatical Average number of characters are 953 Have fields including writer, date, rating, and summary Amazon reviews have a great deal of context compared to tweets

Classification The algorithm is semi-supervised Seeded with a small group of labeled sentences The seed is annotated with a sarcasm ranking on scale of 1 - 5 Syntactic and pattern based features are used to build a classifier

Data Preprocessing Specific information replaced with general tags to facilitate pattern matching: ‘[PRODUCT]’,‘[COMPANY]’,‘[TITLE]’ ‘[AUTHOR]’ ‘[USER]’,‘[LINK]’ and ‘[HASHTAG]’. All HTML tags removed

Pattern Extraction and Selection Words are classified into high frequency words (HFW) and Content words (CW) A pattern is an ordered sequence of HFWs and slots for CWs Patterns were removed which appear only in reference to a single product Generated patterns were removed if they were present in two seeds with rankings 1 and 5

Pattern Matching

Punctuation based Features Sentence length in words, Number of “!” characters in the sentence Number of “?” characters in the sentence Number of quotes in the sentence Number of capitalized/all capitals words in the sentence.

Data Enrichment Assumption: Sarcastic sentences frequently co- appear in text with other sarcastic sentences Authors performed automated web search using Yahoo! BOSS API and composed a search engine query containing a sarcastic sentence from training set Label of new extracted sentence was found similar to the label of the sentence that was used for query

Classification Similar to kNN (k-nearest neighbour) The score for a new instance is the weighted average of the k nearest training set vectors, measured using Euclidean distance

Training Sets Amazon: 80 positive and 505 negative examples 471/5020 expanded Twitter 1500 #sarcasm hash tagged tweets Noisy and Biased

Star-sentiment Baseline Baseline implemented to capture the notion of sarcasm, trying to meet the definition stated earlier. Identify reviews with star rating and classify those sentences that exhibit strong positive sentiment as sarcastic

Test Sets 90 positive and 90 negative examples each for Amazon and Twitter Only sentences containing a named entity or named entity reference were sampled Non-sarcastic sentences belong only to negative reviews, increasing the chance that they contain negative sentiment Mechanical Turk used to create a gold standard for the test set. Each sentence was annotated by 3 annotators.

Inter-Annotator Agreement Amazon: k = 0.34 Twitter: k = 0.41 Superior Twitter agreement is attributed to lack of context in the medium

Tables

Conclusion SASI exhibits the best overall performance with 91.2% precision and f-score of 0.72 In the second experiment based on goal standard annotation SASI had precision of 0.766 A significant improvement over the baseline (0.5) Results on twitter dataset found better than those obtained on amazon dataset

Questions The paper exploit the metadata provided by Amazon, namely the star rating each reviewer is obliged to provide, in order to identify unhappy reviewers. From this set of negative reviews, our baseline classifies as sarcastic those sentences that exhibit strong positive sentiment. Does the paper check for negation of such strong positive sentiments? What can be the reason behind Twitter dataset producing better results than those obtained from Amazon dataset, even though the tweets are less structured and context free?

Questions As researchers have performed pattern selection, pattern matching and data enrichment only on Amazon data set, Can we perform pattern selection, pattern matching and data enrichment on Twitter dataset? What are some other features that could be used to identify sarcasm in a tweet? Since sarcasm is very subjective and context based, what other features do you think can help understand and identify the context better? (Thus improving sarcasm identification).

Questions Would SASI capture more underlying sarcastic features on the Amazon dataset if they hadn’t restricted it so much?(On Twitter, were the restriction were less it seemed to perform better) How will larger k value (k: closest vector) in kNN classification effect the computation and classification? (example: computation: time complexity and classification : sarcasm) During pattern selection, why were the clearly sarcastic tweets removed from analysis. Wouldn't they help in better classification?

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

Similar presentations

Presentation on theme: "Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

Similar presentations

Presentation on theme: "Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu."— Presentation transcript:

Similar presentations

About project

Feedback