Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA.

Slides:



Advertisements
Similar presentations
Dan Jurafsky Lecture 4: Sarcasm, Alzheimers, +Distributional Semantics Computational Extraction of Social and Interactional Meaning SSLST, Summer 2011.
Advertisements

Farag Saad i-KNOW 2014 Graz- Austria,
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.
Tweet Classification for Political Sentiment Analysis Micol Marchetti-Bowick.
Identifying Sarcasm in Twitter: A Closer Look
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Multimedia Answer Generation for Community Question Answering.
Problem Semi supervised sarcasm identification using SASI
LINGUISTICA GENERALE E COMPUTAZIONALE SENTIMENT ANALYSIS.
Pollyanna Gonçalves (UFMG, Brazil) Matheus Araújo (UFMG, Brazil) Fabrício Benevenuto (UFMG, Brazil) Meeyoung Cha (KAIST, Korea) Comparing and Combining.
Sentiment Analysis Bing Liu University Of Illinois at Chicago
Extract from various presentations: Bing Liu, Aditya Joshi, Aster Data … Sentiment Analysis January 2012.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
Applicability of N-Grams to Data Classification A review of 3 NLP-related papers Presented by Andrei Missine (CS 825, Fall 2003)
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
CIS630 Spring 2013 Lecture 2 Affect analysis in text and speech.
807 - TEXT ANALYTICS Massimo Poesio Lecture 4: Sentiment analysis (aka Opinion Mining)
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Automatic Sentiment Analysis in On-line Text Erik Boiy Pieter Hens Koen Deschacht Marie-Francine Moens CS & ICRI Katholieke Universiteit Leuven.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Mining and Summarizing Customer Reviews
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Opinion Mining : A Multifaceted Problem Lei Zhang University of Illinois at Chicago Some slides are based on Prof. Bing Liu’s presentation.
莆田二十八中学 陈海泉. 1 Greeting 2 Enjoying a song Make sentences which includes “wh- + to do” according to the pictures. wonder, where not know, how I wonder.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
In this website, there is chart that explains Germany’s marriage statistics. It’s very clear to see. It shows statistic agenda, years, marital status,
Sentiment Detection Naveen Sharma( ) PrateekChoudhary( ) Yashpal Meena( ) Under guidance Of Prof. Pushpak Bhattacharya.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Opinion Mining of Customer Feedback Data on the Web Presented By Dongjoo Lee, Intelligent Databases Systems Lab. 1 Dongjoo Lee School of Computer Science.
What are these students doing?. Next, you are going to do a team competition. You will see some questions, but the words are in the wrong order. You will.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
CSC 594 Topics in AI – Text Mining and Analytics
Extracting Hidden Components from Text Reviews for Restaurant Evaluation Juanita Ordonez Data Mining Final Project Instructor: Dr Shahriar Hossain Computer.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales Bo Pang and Lillian Lee Cornell University Carnegie.
2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.
Aspect Level Sentiment Classification For Arabic Language Mahmoud El Razzaz ISSR.CU Under the Supervision of Dr. Mohamed Farouk Prof. Dr. Hesham A. Hefny.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Thumbs up? Sentiment Classification using Machine Learning Techniques Jason Lewris, Don Chesworth “Okay, I’m really ashamed of it, but I enjoyed it. I.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Sentiment Analysis on Tweets. Thumbs up? Sentiment Classification using Machine Learning Techniques Classify documents by overall sentiment. Machine Learning.
A Sentiment-Based Approach to Twitter User Recommendation BY AJAY ABDULPUR RAJARAM NIKKAM.
A Simple Approach for Author Profiling in MapReduce
Sentiment Analysis of Twitter Messages Using Word2Vec
Name: Sushmita Laila Khan Affiliation: Georgia Southern University
Sentiment Analysis on Interactive Conversational Agent/Chatbots
Sentiment analysis algorithms and applications: A survey
University of Computer Studies, Mandalay
Sentiment Analysis Study
MID-SEM REVIEW.
An Overview of Concepts and Selected Techniques
Seminar Topics and Projects
Text Mining & Natural Language Processing
商業智慧實務 Practices of Business Intelligence
Introduction to Sentiment Analysis
Big Data Big Data first appeared towards the end of the 1990’s and has become a buzz word in the last few years.
Austin Karingada, Jacob Handy, Adviser : Dr
Presentation transcript:

Peiti Li 1, Shan Wu 2, Xiaoli Chen 1 1 Computer Science Dept. 2 Statistics Dept. Columbia University 116th Street and Broadway, New York, NY 10027, USA introducing Movie Review

It is a fast and more direct way for people to share their opinions on a topic Why ?

Python Twitter Search API + Stream API

Opinion Mining or Sentiment Analysis Computational study of opinions, sentiments, subjectivity, attitudes

Just like a text classification task but different from topic-based text classification In topic-based text classification (e.g., computer, sport, science), topic words are important. But in sentiment classification, opinion/sentiment words are more important, e.g., awesome, great, excellent, horrible, bad, worst, etc.

Structure the unstructured: Natural language text is often regarded as unstructured data Besides data mining, we need NLP technologies Why a HARD task? I bought an iPhone a few days ago. It is such a nice phone. The touch screen is really cool. The voice quality is clear too. It is much better than my old Blackberry, which was a terrible phone and so difficult to type with its tiny keys. However, my mother was mad with me as I did not tell her before I bought the phone. She also thought the phone was too expensive,… Credits: Bing Liu for this example

Tell people whether to go to buy a movie ticket using tweets Classify the tweet as either positive or negative Give a rating of the movie based on tweets

Different Machine Learning Approaches Accuracies Table from: Bo Pang et al Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proc. Of the ACL, pp Association for Computational Linguistics

Our approach is Naïve Bayes P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence) Smoothing: P(token | sentiment) = (count(this token in class) + 1) / (count(all tokens in class) + count(all tokens)) We didn’t use any third-party classifier, we coded our classifier all by ourselves. Reason: want to explore what is under the hook; tune the algorithm structure according to the experiment result

Getting Started.

» Dev set: The movie review dataset provided by Bo Pang and Lillian Lee, Cornell University sentence_polarity_dataset_v positive, 5331 negative » Real set: Tweets about a specific movie Cannot tell exact number Twitter Search API(REST): last 6-7 days Twitter Stream API: real timeline (Drawbacks: REST API has rate limiting; Stream data takes time to collect.) Dataset

Top 100 words including stopwords

Better and better but…. Baseline model is the Naïve Bayes, without any nontrivial text preprocessing; punctuations excluded, stopwords included Tuned model still Naïve Bayes, better feature extraction technique: eliminating low information features. Best unigram model, best unigram and bigram model

Dev set result: Trainset 5000, Testset 331RecallSpecificityAccuracy Baseline76.13%82.78%79.46% Baseline, stopwords removed 75.83%79.46%77.64% Best unigram, stopwords not removed 83.99%85.20%84.60% Best unigram, stopwords removed 82.78%85.80%84.29% Best unigram and bigram, stop words not removed N/A 78.24% Takes 1 hour! Intel Core i5 laptop died in the middle because of too hot for too long Observation: definitely not consider bigrams, but still don’t know whether we should remove the stopwords

5 neg, 87 pos 150 tweets 75 labeled by Xiaoli, 75 labeled by Shan 75 labeled by Xiaoli, 75 labeled by Shan 150 tweets 76 neg, 32 pos

Regular expression 1: Regular expression 2: (#[A-Za-z0-9]+) | \t])|(\w+:\/\/\S+) (All punctuations removed) HugoMuppetstogether stopwords remv64.13%64.81%64.50% stopword incld63.04%54.63%58.50% stopwords remv70.65%62.96%66.5% stopwords incld65.22%53.70%59.00% Results on the 2 recent movies(Real set) Which regular expression should we choose based on this result? Hard to say…. :-(

. lingPipe, Twendz, Twitter Sentiment, tweetfeel We moved our attention to:

twittersentiment.appspot.com They are new too.

Our classifier get the exact same results with them, but wait…

Two pieces of tweet made us frown :-(

Emoticons play a role!!! >:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^) >:D :-D :D 8-D 8D x-D xD X-D XD =-D =D =-3 =3 :P FTW :'( ;*( :_( T.T T_T Y.Y Y_Y >:[ :-( :( :-c :c :-< :< :-[ :[ :{ >.>. :\ >:/ :-/ :-. :/ :\ =/ =\ :S

So we choose the regular expression that will keep emoticons And we build a dictionary to eliminate all the punctuations that appear alone ','_','+','=','{','}','[',']',';',':','"',"'",' ',',','.','?','|','\\','/'

Finally, the python begins to catch the twittering bird…….. Demo

“Happy” Feet? So all tweets are positive? We still need to do more semi- supervised learning. 1.Specific bigrams like “don’t love” 2.Finer classifier which can exclude objectives 3. Detect and remove annoying movie name like “Happy Feet” 4. Give more weights to dominant words like “excellent”, “worst” 5. Our final task: Give ratings

Thank you all! Thank you STAT 4240! Thank you Columbia!