Sentiment Analysis in Turkish Media Introduction The Algorithm Experiments Results adn Summary Future Works Sentiment Analysis in Turkish Media Cumali Türkmenoğlu, Ahmet Cüneyd Tantuğ Cumali Türkmenoğlu ICML WISDOM 2014 - June 25 2014
Introduction Datasets Methods Evaluation Conclusion Table of Contents Introduction The Algorithm Experiments Results adn Summary Future Works Table of Contents Introduction Datasets Methods Evaluation Conclusion
Introduction The Algorithm Experiments Results adn Summary Future Works
Introduction The Algorithm Experiments Results adn Summary Future Works Sentiment Analysis Sentiment Analysis: Attempts to identify(classify) the sentiment that a person may hold towards an object/topic in a text. Lexicon based sentiment classification Machine Learning based sentiment classification Example: Movie review TR: ‘Acemi bir yönetmen ve vasat bir film. Tavsiye etmem.’ EN: ‘A novice director and a mediocre movie. I would not recommend.’ * Most of sentiment information holded by opinion words and other important words (negation et.) which are colored.
About Turkish Language Introduction The Algorithm Experiments Results adn Summary Future Works About Turkish Language Turkish is an agglutinating language in which it is possible to add many suffixes to roots of words. These derivational and inflectional suffixes can change the POS tag and semantic of the word. 555 Surface English Root araba car araba-lar cars araba-lar-ı their cars araba-lar-ın your cars araba-lar-ın-dan from your cars From one root, more than 25k surface forms. Complex Morphology Surface English root elmasını your apple elma your diamond elmas From one surface, more than one root. Morphological Ambiguity
To evaluate the performance of Introduction The Algorithm Experiments Results adn Summary Future Works Motivation To evaluate the performance of Lexicon Based Sentiment SA vs Machine Learning Based SA on different type of Turkish texts with varying characteritics Twitter data (short+informal) Movie reviews data (relatively long+formal) and exploring new fatures to use for Lexicon based SA such as MWEs and absence/presence suffixes.
Introduction The Algorithm Experiments Results adn Summary Future Works Datasets
Twitter Dataset Movie Reviews 2980 tweets 1677 positive 1301 negative Introduction The Algorithm Experiments Results adn Summary Future Works Datasets 2980 tweets 1677 positive 1301 negative Shorter Noisy 14 words per tweet 6 different topics Manually labeled 20244 reviews 13224 positive 7020 negative Relatively longer Rel. less noisy 38 words 1 topic : Movie Automatically labeled Movie Reviews Twitter Dataset
Introduction The Algorithm Experiments Results adn Summary Future Works Methods
Sentiment Classification System Overview Introduction The Algorithm Experiments Results adn Summary Future Works Tweets / movie reviews A number of preprocessing steps are required due to the productive Turkish morphology. Preprocessing Sentiment Classification evaluation Lexicon based Sentiment Analysis Machine Learning based Sentiment Analysis Evaluating both methods on twitter and movie review datasets.
Introduction The Algorithm Experiments Results adn Summary Future Works Preprocessing Steps Turkish requires some important preprocessing steps due to its aglomerative structure. . . . son son+Adj son son+Noun+A3sg+Pnon+Nom ✓ . . . kazanamadı kazan+Verb^DB+Verb+Able+Neg+Past+A3sg mağlup mağlup+Adj oldu ol+Verb+Pos+Past+A3sg. Deasciifying Morphological Analysis Morphological Disambiguation Multi-Words Extraction Sentiment Classification Galatasaray son macini kazanamadi, maglup oldu ama umutsuz degiliz. Sevgimiz büyük, ,Sampiyon cimbom :) Gatasaray son maçını kazanamadı, mağlup oldu ama umutsuz değiliz. Sevgimiz büyük, Şampiyon cimbom :) Tweets / movie reviews Positive or Negative “galatasaray son maç kazan+eylem mağlup_ol+eylem ama umutsuz değil sevgi büyük şampiyon cimbom >]” Preprocessing Lexicon or ML based
absence/presence suffixes Introduction The Algorithm Experiments Results adn Summary Future Works Lexicon Based Sentiment Classification A lexicon of 2127 Neg(–) terms 1530 Pos(+) terms 700 MWEs 650 words with absence/presenc suffixes negation handling preprocessing Tweets / movie reviews Lexicon Based SA boosting Words Calculating sentimental polarity ‘‘galatasaray son maç kazan+verb[2][Neg] değil mağlup_ol+verb[-2] ama umutsuz[-3] [Neg] değil sevgi[3] büyük şampiyon[2] cimbom’’ Pos: +10 Neg: -4 Absence/presence derivative suffixes (+sız/+siz (without), +lı/+li (with)) in Turkish. onur -> honor onurlu -> with honor onursuz -> without honor Booster words list which have a boosting effect when met before an adjective. Çok güzel -> very beautiful En iyisi -> The best one Negation words are ‘‘değil’’ and ‘‘yok’’. - Güzel -> Beautiful. - Güzel değil -> He/She is not beautiful. Negation suffixes are ‘‘+ma’’ and ‘‘+me’’. - Sev(mek) -> (to) love. - Sevmedi -> He/she did not love. Score = Pos+Neg = +10-4 = +6 +6 > 0 Class = Positive absence/presence suffixes Polarity detection Positive or Negative
ML Based Sentiment Classification Introduction The Algorithm Experiments Results adn Summary Future Works ML Based Sentiment Classification preprocessing bag-of-Words rep. unigrams and bigrams POS tags Tweets / movie reviews ML Based SA preprocessing steps ON surface forms Text Classification SVM – NB – Decision Trees (10 fold cross validation) Positive or Negative
Introduction The Algorithm Experiments Results adn Summary Future Works Evaluation
Evaluation of Lexicon Based Method Module Twitter Dataset Movie Dataset Acc % No deasciification 73.8 74.5 No disambiguation 77.0 No negation handling 72.4 76.5 No booster 74.7 No MWEs Extraction 78.0 No absence/presence suffix handling 73.7 All modules on 75.2 79.0 Only Lexicon (All linguistic modules off) 68.0 71.0
Evaluation of ML Based Method Module Twitter Dataset Movie Dataset SVM % NB % J48 % TF-IDF (Unigrams) 84.6 83.7 81.0 88.2 87.0 80.0 TF-IDF (Unigrams) – Surface 83.8 82.5 80.4 88.6 88.7 81.9 TF-IDF (Unigram + Bigram) 85.0 84.3 79.0 89.5 83.0 TF-IDF (Unigram + Bigram) – Surface 82.3 77.4 89.0 82.4
Introduction The Algorithm Experiments Results adn Summary Future Works Conclusion
Conclusion ML based method performs better than Lexicon based method on both short (twitter dataset) and long informal texts (movie dataset). Accuracy of movie dataset is better than accuracy of twitter dataset in both Lexicon based and ML based sentiment analysis methods. MWEs extraction and handling absence/presence suffixes bring reasonable improvement to performance of Lexicon based SA. It proved that discovering such hidden information is promising. So concept-based sentiment analysis with dependency parsing is promissing and could be a future work for us.
Opinions in idiomatic expressions and verb phrases Spelling mistakes Why Errors? Opinions in idiomatic expressions and verb phrases Spelling mistakes Irony and sarcasm Dependency on wrong topic/entity
Introduction The Algorithm Experiments Results adn Summary Future Works Thanks…
Extra Slides * (-mek, -mak) are infinity suffixes in Turkish Multi-Words Literally Meaning in English Sentiment score. Kafayı ye(mek) Eat the head To get mentally deranged +2 Adam ol(mek) Be man Be a good man Kafayı çek(mek) To pull the head Consume alcohol Güzel ol(mek) Be beautiful They had not be loved -2 * (-mek, -mak) are infinity suffixes in Turkish
Extra Slides It is clear from table that Twitter dataset is the most noisy and informal dataset.