Download presentation
Presentation is loading. Please wait.
Published byBarbara Nichols Modified over 9 years ago
1
Sentiment Analysis of Social Media Content using N-Gram Graphs Authors: Fotis Aisopos, George Papadakis, Theordora Varvarigou Presenter: Konstantinos Tserpes National Technical University of Athens, Greece
2
Social Media and Sentiment Analysis Social Networks enable users to: – Chat about everyday issues – Exchange political views – Evaluate services and products Useful to estimate average sentiment for a topic (e.g. social analysts) Sentiments expressed – Implicitly (e.g. through emoticons, specific words) – Explicitly (e.g. the “Like” button in Facebook) In this work we focus on content-based patterns for detecting sentiments. 30/11/20112 International ACM Workshop on Social Media (WSM11)
3
Intricacies of Social Media Content Inherent characteristics that turn established, language-specific methods inapplicable: – Sparsity: each message comprises just 140 characters in Twitter – Multilinguality: many different languages and dialects – Non-standard Vocabularty: informal textual content (i.e., slang), neologisms (e.g. “gr8” instead of “great”) – Noise: misspelled words and incorrect use of phrases. Solution language-neutral method that is robust to noise 30/11/20113 International ACM Workshop on Social Media (WSM11)
4
Focus on Twitter We selected the Twitter micro-blogging service due to: – Popularity (200 million users, 1 billion posts per week) – Strict rules of social interaction (i.e., sentiments are expressed through short, self-contained text messages) – Data publicly available through a handy API 30/11/20114 International ACM Workshop on Social Media (WSM11)
5
Polarity Classification problem Polarity: express of a non-neutral sentiment – Polarized tweets: tweets that express either a positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons) – Neutral tweets: tweets lacking any polarity indicator Binary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive). General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral). 30/11/20115 International ACM Workshop on Social Media (WSM11)
6
Representation Model 1: Term Vector Model Aggregates the set of distinct words (i.e., tokens) contained in a set of documents. Each tweet t i is then represented as a vector: v ti = (v 1, v 2,..., v j ) where v j is the TF-IDF value of the j-th term. The same model applies to polarity classes. Drawbacks: It requires language-specific techniques that correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging). High dimensionality 30/11/20116 International ACM Workshop on Social Media (WSM11)
7
Representation Model 2: Character n-grams Each document and polarity class is represented as the set of substrings of length n of the original text. for n = 2: bigrams, n = 3: trigrams, n = 4: fourgrams example: “home phone" consists of the following tri- grams: {hom, ome, me, ph, pho, hon, one}. Advantages: language-independent method. Disadvantages: high dimensionality 30/11/20117 International ACM Workshop on Social Media (WSM11)
8
Representation Model 3: Character n-gram graphs Each document and polarity class are represented as graphs, where the nodes correspond to character n-grams, the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), and the weight of an edge denotes the co-occurrence rate of the adjacent n-grams. Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs). 30/11/20118 International ACM Workshop on Social Media (WSM11)
9
Example of n-gram graphs. The phrase “home_phone” is represented as follows : 30/11/20119 International ACM Workshop on Social Media (WSM11)
10
Features of the n-gram graphs model To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs): – Containment Similarity (CS): portion of common edges, regardless of their weights – Size Similarity (SS): ratio of sizes of two graphs – Value Similarity (VS): portion of common edges, taking into account their weights – Normalized Value Similarity (NVS): value similarity without the effect of the relative graph size (i.e., NVS =VS/SS) 30/11/201110 International ACM Workshop on Social Media (WSM11)
11
Features Extraction Create G pos, G neg (and G neu ) by aggregating half of the training tweets with the respective polarity. For each tweet of the remaining training set: – create tweet n-gram graph G ti – derive a feature “vector” from graphs comparison Same procedure for the testing tweets. 30/11/201111 International ACM Workshop on Social Media (WSM11)
12
Discretized Graph Similarities Discretized similarity values offer higher classification efficiency. They are created according to the following function: Binary classification has three nominal features: dsim(CS neg, CS pos ) dsim(NVS neg, NVS pos ) dsim(VS neg, VS pos ) General classification has six more nominal features: dsim(CS neg, CS neu ) dsim(NVS neg, NVS neu ) dsim(VS neg, VS neu ) dsim(CSneu, CSpos) dsim(NVSneu, NVSpos) dsim(VSneu, VSpos) 30/11/201112 International ACM Workshop on Social Media (WSM11)
13
Data set Initial dataset: – 475 million real tweets, posted by 17 million users – polarized tweets: 6.12 million negative 14.12 million positive Data set for Binary Polarity Classification: Random selection of 1 million tweets from each polarity category. Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets. 30/11/201113 International ACM Workshop on Social Media (WSM11)
14
Experimental Setup 10-fold cross-validation. Classification algorithms (default configuration of Weka): – Naive Bayes Multinomial (NBM) – C4.5 decision tree classifier Effectiveness Metric: classification accuracy (correctly_classified_documents/all_documents). Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered. 30/11/201114 International ACM Workshop on Social Media (WSM11)
15
Evaluation results n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant) n-gram graphs: – low accuracy for NBM, higher values overall for C4.5 – n incremented by 1: performance increases by 3%-4% 30/11/201115 International ACM Workshop on Social Media (WSM11)
16
Efficiency Performance Analysis n-grams involve the largest by far set of features -> high computational load four-grams: less features than trigrams (their numerous substrings are rather rare) n-gram graphs: significantly lower number of features in all cases ( much higher classification efficiency! 30/11/201116 International ACM Workshop on Social Media (WSM11)
17
Improvements (work under submission) We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency). We included in the training stage the tweets that were used for building the polarity classes. Outcomes: – Higher performance for all methods. – N-gram graphs again outperform all other models. – Accuracy takes significantly higher values (>95%) 30/11/201117 International ACM Workshop on Social Media (WSM11)
18
Thank you! 30/11/2011 International ACM Workshop on Social Media (WSM11) 18 SocIoS project: www.sociosproject.eu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.