Crowd explicit sentiment analysis A. Montejo-Raez, M.C. Diaz-Galiano, F. Martinez-Santiago, L.A. Urena-Lopez Computer Science Department, University of.

Slides:



Advertisements
Similar presentations
Content-based Recommendation Systems
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, Bing Qin
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
On Roles of Models in Information Systems (Arne Sølvberg) Gustavo Carvalho 26 de Agosto de 2010.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Introduction to Machine Learning, its potential usage in network area,
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Sentiment analysis algorithms and applications: A survey
School of Computer Science & Engineering
MID-SEM REVIEW.
Artificial Intelligence with Heart: Improving Customer Experience through Sentiment Analysis.
Efficient Estimation of Word Representation in Vector Space
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

Crowd explicit sentiment analysis A. Montejo-Raez, M.C. Diaz-Galiano, F. Martinez-Santiago, L.A. Urena-Lopez Computer Science Department, University of Jaen Knowledge-Based Systems 69 (2014) 報告者:劉憶年 2015/11/3

Outline Introduction Sentiment analysis in social media Using the crowd to collect affective terms Crowd explicit sentiment analysis Experiments and results Conclusions and further work 2

Introduction (1/3) There is a significant increase in information sources and, therefore, data registered by our society since the advent of information and communication technologies. We all are now aware of it, but we may not realize the impressive numbers behind this fact. We cannot ignore this explosion and how Big Data has been revealed as the new major challenge in computation. Our ability to produce unlimited information may contrast with storage limits and computer performances, which, although still in continuous advance, may be overtaken soon by human and non-human uploaded data. 3

Introduction (2/3) This paper is only a small proposal on how we can face the new era of overwhelming availability of information, providing a strategy based on the construction of knowledge from that continuous stream of data by keeping a tiny, but filtered, part of it. In fact, we do not have to, as we have just too much of it. New knowledge can be constructed with the help of simple heuristics. It is similar to the approach followed by our brain, which is also filtering at every second thousands of incoming stimuli, generating from the rest what may not be the best understanding of reality, but a knowledge more than valid for our survival as human beings. 4

Introduction (3/3) Following this reasoning, we have designed an approach to polarity classification in sentiment analysis. It can be considered a simple approach, but the results confirm its validity, encouraging us to discover new areas where the idea ‘‘Let the crowd express itself’’ could be applied. 5

Sentiment analysis in social media (1/3) Sentiment Analysis (also known as Opinion Mining) is one of the most active research areas in Natural Language Processing nowadays, with special interest in the classification of texts into positive, negative or neutral. Supervised strategies have reported the best results since the earliest studies and are still the choice for many solutions, from Information Theory based features (with SVM classifier) to more complex learned rules. Unsupervised approaches have relied mainly on the use of lexicons where words are associated with polarity scores, although more advanced solutions using intensive lexical analysis are proposed or using deep learning approaches, the latter being a very promising and innovative way to tackle the problem. 6

Sentiment analysis in social media (2/3) The second group of proposals (unsupervised ones) are mainly based on the creation of a list of affective terms, which is, again, usually closely related to the domain of the targeted texts and hard to generate in different languages. At any rate, these studies follow the line of concept-level sentiment analysis, as the analyzed texts are represented by a vector of ‘‘feelings’’ rather than pure term vectors. Most of these knowledge bases do not consider contextual information that may modify the polarity of collected concepts, as some terms may become subjective when accompanying certain words, or even completely change their polarity values when used along with some modifiers or within the context of certain phrases. 7

Sentiment analysis in social media (3/3) Twitter has been found to be very useful in many scenarios, like real-time Recommender Systems, cinema revenue prediction or even crime prediction, among others. 8

Using the crowd to collect affective terms -- The WeFeelFine project (1/3) Since the year 2005, the website WeFeelFine1 has been harvesting from social media millions of sentences containing ‘‘I feel’’ or ‘‘I am feeling’’ expressions, creating a huge database of sentences related to feelings or emotions. Although the main goal of the project is to serve as a monitor of the human state at a global level, we found that the collected data could be useful in sentiment analysis. The current list of feelings stored contains 2178 different feelings, although the 200 most frequent ones represent 70% of a total of almost 2 million sentences. These are the ones considered in this study. 9

Using the crowd to collect affective terms -- The WeFeelFine project (2/3) 10

Using the crowd to collect affective terms -- The WeFeelFine project (3/3) WeFeelFine is a very interesting project and its continuous trawling of data could represent a valuable resource in sentiment analysis, as considered by previous studies, where a bag of sentiment words is created using WeFeelFine lists of feelings and augmented with synonyms and antonyms from Thesaurus. 11

Using the crowd to collect affective terms -- The MeSientoX corpus (1/2) As the generation of a feelings database based on simple regular expressions is not a difficult task, we decided to test this approach for Spanish. Instead of translating the texts from the WeFeelFine database, almost two million Spanish tweets that contain the words ‘‘me siento’’ (‘‘I feel’’) were collected by means of the Twitter API. The tweets were retrieved during 35 days, between December 2012 and January 2013, collecting a total of 1,863,758 tweets. Our first attempt at polarity classification with this data was performed with promising results [self-reference removed], and also with a retrieval based solution. 12

Using the crowd to collect affective terms -- The MeSientoX corpus (2/2) A unified form is the merging of the two forms derived from genre variants in Spanish, so tweets with the expression ‘‘Me siento cansada’’ or ‘‘Me siento cansado’’ would be under the same feeling cansado. We also discarded those unified forms that could be considered non-sentiment words, such as words in a non-Spanish language (alone, crazy). This is the only step involving human intervention, though the effort is minimal (the emotions extracted were labeled in less than ten minutes). The number of sentiment words selected was 201 (the most frequent ones), of which 84 were considered as positive and 117 as negative from a total number of different unified forms of

Crowd explicit sentiment analysis (1/4) Explicit Semantic Analysis proposes the use of a collection of documents to form the indexes of new documents. In our case, the WeFeelFine data is taken as the base for generating English feelings documents, whereas tweets extracted from MeSientoX are used to generate Spanish feelings documents. Each feeling X is represented as a compilation of the tweets retrieved containing the expression Me siento X, assuming that the term X refers to a feeling. Thus, instead of projecting a document onto a space of articles, it is projected onto a space of feelings collected automatically from social media posts. The distances of the vector are cosine distances obtained by means of a Latent Semantic Analysis. 14

Crowd explicit sentiment analysis (2/4) 15

Crowd explicit sentiment analysis (3/4) Therefore, a document is preprocessed to obtain its vector and then multiplied by a low-rank approximation (by Singular Value Decomposition) of the feeling-to-term matrix created from the corpora generated from micro- blog posts. 16

Crowd explicit sentiment analysis (4/4) This second way of computing the final polarity value only takes into consideration the order of the feelings, not the actual distance of them to the target document. As feelings ‘‘emerge’’ from the collected data, and due to the randomness of texts captured, the cosine distance may not add relevant information to the model. 17

Crowd explicit sentiment analysis -- Integrating SenticNet 3 The last one available (3-beta) has been enhanced by considering further knowledge sources. Although even fewer concepts are considered in SenticNet 3 compared to SenticNet 2, the inclusion of common and common sense knowledge has resulted in a most coherent net of emotional concepts. We found that it could be the straightforward solution for labeling crowd-based emotional concepts (feelings). 18

Experiments and results (1/4) For the English experiments, the Emoticon data set from Stanford University was selected. In order to enable the comparison of results with other approaches, only the test set is considered. It contains 177 negative tweets and 182 positive tweets, manually labeled. For experiments with Spanish, we selected the SFU Review corpus. It is composed of 400 reviews divided into eight categories: cars, hotels, washing machines, books, cell phones, music, computers, and movies. Each category contains 50 positive and 50 negative reviews, defined as positive or negative based on the number of stars given by the reviewer (1–2 = negative; 4–5 = positive; 3-star reviews are not included). These reviews were collected from the Ciao web site. 19

Experiments and results (2/4) 20

Experiments and results (3/4) This finding makes us think of the complexity of sentiment representation, so documents are better represented by several emotional states instead of pure polarity classes. This is in agreement with a treatment of the sentiment analysis problem at a concept-level. One reason for such a behavior may be the big difference in the quality of the texts between the two data sets. WeFeelFine provides good grammar and very few misspellings or jargon terms and sentences are longer and richer in expressiveness, whilst the compilation of tweets in the case of MeSientoX corpus is far from well written Spanish. Also, the cosine distance may not reflect the real contrast between different feelings in the latter case. 21

Experiments and results (4/4) Nevertheless, the accuracy obtained is high, taking into consideration that there is no normalization over the text obtained from WeFeelFine and neither over the test data. Thus, our approach shows that good performance can be obtained with this straightforward solution, based purely on capturing emotional expressions from blogs and other channels of social communication. In this case, the results of our approach outperform the lexical based solution proposed by these authors. 22

Conclusions and further work (1/3) Crowd Explicit Sentiment Analysis has been introduced as a stream-based approach for polarity classification. Its simple design allows for the construction of polarity classifiers in different languages and domains without the need for complex linguistic resources or architectures. Nevertheless, further research has to be performed as many issues could be explored in order to improve the proposed method. For example, we have found that large quantities of texts without relevant content are captured by the expressions used. Thus, a selection of terms and posts has to be done. 23

Conclusions and further work (2/3) Thus, the difference in the accuracy value may be due to the length of the documents or related to language issues. This needs further analysis and study by exploring more comparable corpora. Therefore, a more accurate capturing technique could be useful, although it is in the intent of the method to avoid too sophisticated solutions, as its strength lies in its simplicity. 24

Conclusions and further work (3/3) Nevertheless, we plan to use these methods but constructing the models using the vectors of feelings that CESA generates. Social based representation of emotions has been found as a valid solution to model affective communication. It could outperform the two approaches for final polarity calculation, as we expect to confirm with future experimentation. In any case, the solutions that could be constructed based on the idea of ‘‘let the crowd express itself’’ are very suitable for big data environments. We believe that on-line learning algorithms and evolving training data could be the key to modeling the knowledge that emerges from the vast amount of texts published every second, everywhere. 25