Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes.

Slides:



Advertisements
Similar presentations
Sentiment Analysis on Twitter Data
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Analysis and Modeling of Social Networks Foudalis Ilias.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Three kinds of learning
Recommender systems Ram Akella November 26 th 2008.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
How Useful are Your Comments? Analyzing and Predicting YouTube Comments and Comment Ratings Stefan Siersdorfer, Sergiu Chelaru, Wolfgang Nejdl, Jose San.
Towards Cross-Language Sentiment Analysis through Universal Star Ratings KMO 2012 Malissa Bal Erasmus University Rotterdam Flavius.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
Algorithmic Detection of Semantic Similarity WWW 2005.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
​ Text Analytics ​ Teradata & Sabanci University ​ April, 2015.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Reputation Management System
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Sentiment Analysis CMPT 733. Outline What is sentiment analysis? Overview of approach Feature Representation Term Frequency – Inverse Document Frequency.
A Simple Approach for Author Profiling in MapReduce
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Sentiment Analysis of Twitter Messages Using Word2Vec
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Instance Based Learning
Summary Presented by : Aishwarya Deep Shukla
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Using Friendship Ties and Family Circles for Link Prediction
Text Categorization Assigning documents to a fixed set of categories
Roc curves By Vittoria Cozza, matr
Introduction to Sentiment Analysis
Presentation transcript:

Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes National Technical University of Athens, Greece

Social Media and Sentiment Analysis Social Networks enable users to: – Chat about everyday issues – Exchange political views – Evaluate services and products Useful to estimate average sentiment for a topic (e.g. social analysts) Sentiments expressed – Implicitly (e.g. through emoticons, specific words) – Explicitly (e.g. the “Like” button in Facebook) In this work we focus on content-based patterns for detecting sentiments. 30/11/20112 International ACM Workshop on Social Media (WSM11)

Intricacies of Social Media Content Inherent characteristics that turn established, language-specific methods inapplicable: – Sparsity: each message comprises just 140 characters in Twitter – Multilinguality: many different languages and dialects – Non-standard Vocabularty: informal textual content (i.e., slang), neologisms (e.g. “gr8” instead of “great”) – Noise: misspelled words and incorrect use of phrases. Solution language-neutral method that is robust to noise 30/11/20113 International ACM Workshop on Social Media (WSM11)

Focus on Twitter We selected the Twitter micro-blogging service due to: – Popularity (200 million users, 1 billion posts per week) – Strict rules of social interaction (i.e., sentiments are expressed through short, self-contained text messages) – Data publicly available through a handy API 30/11/20114 International ACM Workshop on Social Media (WSM11)

Polarity Classification problem Polarity: express of a non-neutral sentiment – Polarized tweets: tweets that express either a positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons) – Neutral tweets: tweets lacking any polarity indicator Binary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive). General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral). 30/11/20115 International ACM Workshop on Social Media (WSM11)

Representation Model 1: Term Vector Model Aggregates the set of distinct words (i.e., tokens) contained in a set of documents. Each tweet t i is then represented as a vector: v ti = (v 1, v 2,..., v j ) where v j is the TF-IDF value of the j-th term. The same model applies to polarity classes. Drawbacks: It requires language-specific techniques that correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging). High dimensionality 30/11/20116 International ACM Workshop on Social Media (WSM11)

Representation Model 2: Character n-grams Each document and polarity class is represented as the set of substrings of length n of the original text.  for n = 2: bigrams, n = 3: trigrams, n = 4: fourgrams  example: “home phone" consists of the following tri- grams: {hom, ome, me, ph, pho, hon, one}. Advantages:  language-independent method. Disadvantages:  high dimensionality 30/11/20117 International ACM Workshop on Social Media (WSM11)

Representation Model 3: Character n-gram graphs Each document and polarity class are represented as graphs, where the nodes correspond to character n-grams, the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), and the weight of an edge denotes the co-occurrence rate of the adjacent n-grams. Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs). 30/11/20118 International ACM Workshop on Social Media (WSM11)

Example of n-gram graphs. The phrase “home_phone” is represented as follows : 30/11/20119 International ACM Workshop on Social Media (WSM11)

Features of the n-gram graphs model To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs): – Containment Similarity (CS): portion of common edges, regardless of their weights – Size Similarity (SS): ratio of sizes of two graphs – Value Similarity (VS): portion of common edges, taking into account their weights – Normalized Value Similarity (NVS): value similarity without the effect of the relative graph size (i.e., NVS =VS/SS) 30/11/ International ACM Workshop on Social Media (WSM11)

Features Extraction Create G pos, G neg (and G neu ) by aggregating half of the training tweets with the respective polarity. For each tweet of the remaining training set: – create tweet n-gram graph G ti – derive a feature “vector” from graphs comparison Same procedure for the testing tweets. 30/11/ International ACM Workshop on Social Media (WSM11)

Discretized Graph Similarities Discretized similarity values offer higher classification efficiency. They are created according to the following function: Binary classification has three nominal features:  dsim(CS neg, CS pos )  dsim(NVS neg, NVS pos )  dsim(VS neg, VS pos ) General classification has six more nominal features:  dsim(CS neg, CS neu )  dsim(NVS neg, NVS neu )  dsim(VS neg, VS neu )  dsim(CSneu, CSpos)  dsim(NVSneu, NVSpos)  dsim(VSneu, VSpos) 30/11/ International ACM Workshop on Social Media (WSM11)

Data set Initial dataset: – 475 million real tweets, posted by 17 million users – polarized tweets: 6.12 million negative million positive Data set for Binary Polarity Classification: Random selection of 1 million tweets from each polarity category. Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets. 30/11/ International ACM Workshop on Social Media (WSM11)

Experimental Setup 10-fold cross-validation. Classification algorithms (default configuration of Weka): – Naive Bayes Multinomial (NBM) – C4.5 decision tree classifier Effectiveness Metric: classification accuracy (correctly_classified_documents/all_documents). Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered. 30/11/ International ACM Workshop on Social Media (WSM11)

Evaluation results n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant) n-gram graphs: – low accuracy for NBM, higher values overall for C4.5 – n incremented by 1: performance increases by 3%-4% 30/11/ International ACM Workshop on Social Media (WSM11)

Efficiency Performance Analysis n-grams involve the largest by far set of features -> high computational load four-grams: less features than trigrams (their numerous substrings are rather rare) n-gram graphs: significantly lower number of features in all cases ( much higher classification efficiency! 30/11/ International ACM Workshop on Social Media (WSM11)

Improvements (work under submission) We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency). We included in the training stage the tweets that were used for building the polarity classes. Outcomes: – Higher performance for all methods. – N-gram graphs again outperform all other models. – Accuracy takes significantly higher values (>95%) 30/11/ International ACM Workshop on Social Media (WSM11)

Thank you! 30/11/2011 International ACM Workshop on Social Media (WSM11) 18 SocIoS project: