Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Dan Jurafsky Lecture 4: Sarcasm, Alzheimers, +Distributional Semantics Computational Extraction of Social and Interactional Meaning SSLST, Summer 2011.
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Sentiment Analysis on Twitter Data
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Identifying Sarcasm in Twitter: A Closer Look
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Problem Semi supervised sarcasm identification using SASI
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Semi Supervised Recognition of Sarcastic Sentences in Twitter and Amazon Dmitry DavidovOren TsurAri Rappoport.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Sarcasm Detection on Twitter A Behavioral Modeling Approach
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Mining and Summarizing Customer Reviews
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
The identification of interesting web sites Presented by Xiaoshu Cai.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing Seung-Taek Park and David M. Pennock (ACM SIGKDD 2007)
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Sentiment Analysis with Incremental Human-in-the-Loop Learning and Lexical Resource Customization Shubhanshu Mishra 1, Jana Diesner 1, Jason Byrne 2, Elizabeth.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer, Mark Dredze and Fernando Pereira University.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Your Sentiment Precedes You: Using an author’s historical tweets to predict sarcasm Anupam Khattri 2, Aditya Joshi 1,3, Pushpak Bhattacharyya 1, Mark James.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Extracting Hidden Components from Text Reviews for Restaurant Evaluation Juanita Ordonez Data Mining Final Project Instructor: Dr Shahriar Hossain Computer.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Post-Ranking query suggestion by diversifying search Chao Wang.
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Extending SASI to Satirical Product Reviews: A Preview Bernease Herman University of Michigan Monday, April 22, 2013.
Reputation Management System
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Web Page Classifiers Inmaculada Hernández. Roadmap Introduction Classifiers Taxonomy Evaluation Conclusions & Future Work.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Sentiment Analysis on Tweets. Thumbs up? Sentiment Classification using Machine Learning Techniques Classify documents by overall sentiment. Machine Learning.
Homework 3 Progress Presentation -Meet Shah. Goal Identify whether tweet is sarcastic or not.
Language Identification and Part-of-Speech Tagging
Sentiment Analysis Study
MID-SEM REVIEW.
Measuring Complexity of Web Pages Using Gate
Text Mining & Natural Language Processing
Presentation transcript:

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu

Problem Semi-supervised identification of sarcasm in datasets from popular sites such as Twitter and Amazon. What is Sarcasm? The activity of saying or writing the opposite of what you mean, or of speaking in a way intended to make someone else feel stupid or show them that you are angry. Example: “Wow GPRS data speeds are blazing fast.” (Twitter)

Datasets Twitter Dataset: Tweets contain at most 140 characters Tweets may contain urls, references to other tweeters ) or hashtags # Slang, abbreviations, and emoticons are common 5.9 million tweets 14.2 average words per tweet 18.7% include a url, 35.3% 6.9% contain at least one hashtags

Datasets Amazon Dataset: 66,000 reviews for 120 products that includes books and electronics 953 characters on average Usually structured and grammatical Average number of characters are 953 Have fields including writer, date, rating, and summary Amazon reviews have a great deal of context compared to tweets

Classification The algorithm is semi-supervised Seeded with a small group of labeled sentences The seed is annotated with a sarcasm ranking on scale of Syntactic and pattern based features are used to build a classifier

Data Preprocessing Specific information replaced with general tags to facilitate pattern matching: ‘[PRODUCT]’,‘[COMPANY]’,‘[TITLE]’ ‘[AUTHOR]’ ‘[USER]’,‘[LINK]’ and ‘[HASHTAG]’. All HTML tags removed

Pattern Extraction and Selection Words are classified into high frequency words (HFW) and Content words (CW) A pattern is an ordered sequence of HFWs and slots for CWs Patterns were removed which appear only in reference to a single product Generated patterns were removed if they were present in two seeds with rankings 1 and 5

Pattern Matching

Punctuation based Features Sentence length in words, Number of “!” characters in the sentence Number of “?” characters in the sentence Number of quotes in the sentence Number of capitalized/all capitals words in the sentence.

Data Enrichment Assumption: Sarcastic sentences frequently co- appear in text with other sarcastic sentences Authors performed automated web search using Yahoo! BOSS API and composed a search engine query containing a sarcastic sentence from training set Label of new extracted sentence was found similar to the label of the sentence that was used for query

Classification Similar to kNN (k-nearest neighbour) The score for a new instance is the weighted average of the k nearest training set vectors, measured using Euclidean distance

Training Sets Amazon: 80 positive and 505 negative examples 471/5020 expanded Twitter 1500 #sarcasm hash tagged tweets Noisy and Biased

Star-sentiment Baseline Baseline implemented to capture the notion of sarcasm, trying to meet the definition stated earlier. Identify reviews with star rating and classify those sentences that exhibit strong positive sentiment as sarcastic

Test Sets 90 positive and 90 negative examples each for Amazon and Twitter Only sentences containing a named entity or named entity reference were sampled Non-sarcastic sentences belong only to negative reviews, increasing the chance that they contain negative sentiment Mechanical Turk used to create a gold standard for the test set. Each sentence was annotated by 3 annotators.

Inter-Annotator Agreement Amazon: k = 0.34 Twitter: k = 0.41 Superior Twitter agreement is attributed to lack of context in the medium

Tables

Conclusion SASI exhibits the best overall performance with 91.2% precision and f-score of 0.72 In the second experiment based on goal standard annotation SASI had precision of A significant improvement over the baseline (0.5) Results on twitter dataset found better than those obtained on amazon dataset

Questions The paper exploit the metadata provided by Amazon, namely the star rating each reviewer is obliged to provide, in order to identify unhappy reviewers. From this set of negative reviews, our baseline classifies as sarcastic those sentences that exhibit strong positive sentiment. Does the paper check for negation of such strong positive sentiments? What can be the reason behind Twitter dataset producing better results than those obtained from Amazon dataset, even though the tweets are less structured and context­ free?

Questions As researchers have performed pattern selection, pattern matching and data enrichment only on Amazon data set, Can we perform pattern selection, pattern matching and data enrichment on Twitter dataset? What are some other features that could be used to identify sarcasm in a tweet? Since sarcasm is very subjective and context based, what other features do you think can help understand and identify the context better? (Thus improving sarcasm identification).

Questions Would SASI capture more underlying sarcastic features on the Amazon dataset if they hadn’t restricted it so much?(On Twitter, were the restriction were less it seemed to perform better) How will larger k value (k: closest vector) in kNN classification effect the computation and classification? (example: computation: time complexity and classification : sarcasm) During pattern selection, why were the clearly sarcastic tweets removed from analysis. Wouldn't they help in better classification?