Hierarchical emotion classification and emotion component analysis on chinese micro-blog posts Hua Xu 1, Weiwei Yang 1, Jiushuo Wang 1, 2 1 State Key Laboratory.

Slides:



Advertisements
Similar presentations
Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
TEMPLATE DESIGN © Identifying Noun Product Features that Imply Opinions Lei Zhang Bing Liu Department of Computer Science,
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
CIS630 Spring 2013 Lecture 2 Affect analysis in text and speech.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Learning Subjective Nouns using Extraction Pattern Bootstrapping Ellen Riloff, Janyce Wiebe, Theresa Wilson Presenter: Gabriel Nicolae.
Presented by Zeehasham Rasheed
Automatic Sentiment Analysis in On-line Text Erik Boiy Pieter Hens Koen Deschacht Marie-Francine Moens CS & ICRI Katholieke Universiteit Leuven.
The Social Web: A laboratory for studying s ocial networks, tagging and beyond Kristina Lerman USC Information Sciences Institute.
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling and Finding Abnormal Nodes (chapter 2) 駱宏毅 Hung-Yi Lo Social Network Mining Lab Seminar July 18, 2007.
A fast identification method for P2P flow based on nodes connection degree LING XING, WEI-WEI ZHENG, JIAN-GUO MA, WEI- DONG MA Apperceiving Computing and.
Opinion Mining on the Web 2.0 Characteristics of User Generated Content and Their Impacts ITEC 547 Text Mining Ass. Professor: Nazife Dimililer Name: Feras.
Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
MINING RELATED QUERIES FROM SEARCH ENGINE QUERY LOGS Xiaodong Shi and Christopher C. Yang Definitions: Query Record: A query record represents the submission.
Conditional Topic Random Fields Jun Zhu and Eric P. Xing ICML 2010 Presentation and Discussion by Eric Wang January 12, 2011.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
1 / 12 PSLC Summer School, June 21, 2007 Identifying Students’ Gradual Understanding of Physics Concepts Using TagHelper Tools Nava L.
Introduction to Text and Web Mining. I. Text Mining is part of our lives.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
A Graph-based Friend Recommendation System Using Genetic Algorithm
A Language Independent Method for Question Classification COLING 2004.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
11 Department of Computer Science, National Tsing Hua University, No. 101 Kuang Fu Road, Hsinchu 300, Taiwan Institute of Information Systems and Applications,
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
*Erasmus University Rotterdam P.O. Box 1738, NL-3000 DR Rotterdam, the Netherlands † Teezir BV Wilhelminapark 46, NL-3581 NL, Utrecht, the Netherlands.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Electronic Surveys Inquiring With Authentic Language By: Hanan Al-Tamimy.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
CSC 594 Topics in AI – Text Mining and Analytics
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Sentimental feature selection for sentiment analysis of Chinese online reviews Lijuan Zheng 1,2, Hongwei Wang 2, and Song Gao 2 1 School of Business, Liaocheng.
Experience Report: System Log Analysis for Anomaly Detection
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Queensland University of Technology
Sentiment analysis algorithms and applications: A survey
Grey Sentiment Analysis
An Enhanced Support Vector Machine Model for Intrusion Detection
iSRD Spam Review Detection with Imbalanced Data Distributions
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

Hierarchical emotion classification and emotion component analysis on chinese micro-blog posts Hua Xu 1, Weiwei Yang 1, Jiushuo Wang 1, 2 1 State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University 2 School of Information Science and Engineering, Hebei University of Science and Technology Expert Systems with Applications 2015 報告者:劉憶年 2015/8/18

Outline Introduction Related work Emotion classification Emotion component analysis Experiment results and analysis Application Conclusion 2

Introduction (1/4) For years, researchers are trying to classify the emotions in text automatically. The views and attitudes, of course, often contain emotions. Micro-blog posts directly reflect users’ opinions. The length of posts brings challenges to emotion classification and requires more effective methods to extract features. Besides, Internet slang is not easy to cope with either because it does not follow language rules. Emotion, definitionly, is a subjective thought or feeling like happy, angry, etc, while sentiment addresses the objective positive and negative attitudes. It is possible that a post contains sentiment but no emotions. 1. The phone broke within two days. 3

Introduction (2/4) Currently, most researchers are focusing on sentiment analysis and emotion classification on six basic coarse- grained emotion classes, which consist of happy, surprise, angry, disgusted, fear and sad. However, coarse-grained emotions cannot depict the emotions in text perfectly. 2. This car is not so easy to drive as the ad says. I am so disappointed. In order to better describe emotions, fine-grained emotions need to be added to coarse-grained emotion categories, which forms hierarchy. Besides, adopting fine-grained emotions greatly increases the number of classes, which brings difficulty for flat classifiers, so hierarchical classification is required. 4

Introduction (3/4) So far, the corpus of most work is in English. Not many papers’ results are based on Chinese. Psychological emotion dictionary, Internet slang dictionary and emoticon dictionary are employed to segment posts and form the feature space, which is then selected by a combination of χ 2 -test, word frequency and pointwise mutual information (PMI), in order to retain effective features. Finally, we employ support vector regression (SVR) and rule sets, which are generated by PMI values, to get the classification results, which, as reported later, are very encouraging. 5

Introduction (4/4) In this paper, a four-level fine-grained emotion hierarchy with 19 basic emotions is adopted. However, posts usually contain more than one kind of emotions. So we propose an emotion component analysis (ECA) algorithm to detect the principal emotions in posts and calculate the corresponding ratios according to the classification results, which, more specifically, according to distances between regression values and class thresholds. 6

Related work (1/3) Although probability-based algorithms are quite useful, machine learning approach is more preferred by researchers nowadays. In order to better classify text, researchers spend time constructing and improving emotion lexicons. Emotion lexicons bring magnificent improvement to emotion classification on text. In addition to classification algorithms and emotion lexicons, corpus is also an option. Some researchers try to classify emotions on blog posts. 7

Related work (2/3) The flat classification can classify the examples directly relative to the hierarchical classification. While the hierarchical classification classifies the examples from top to bottom according to the pre-determined multi-layer classification system and gets the final classification result in the bottom. The flat classification is mostly adopted, which brings difficulty for classifiers to distinguish between the examples belong to its class and other classes when given a large dataset. Recent years, as micro-blog is used more and more widely, micro-blog posts become a new source of corpus for emotion classification. Besides, experiments on other kinds of corpus are also reported, e.g. s, novels and Japanese dialog systems. 8

Related work (3/3) Our contributions are different. We hierarchically classify Chinese micro-blog posts into 19 fine-grained emotion classes with machine learning approach and propose an ECA algorithm based on the regression values. In the process of segmentation, a psychological emotion dictionary is adopted in this paper for improving the effect of the algorithm, which has important scientific values both on social network knowledge discovery and data mining. 9

Emotion classification -- Hierarchy This hierarchy contains 19 fine-grained emotion classes at the bottom level and 20 leaf nodes if considering neutral, which denotes the non-emotional class. 10

Emotion classification -- Preprocessing Usernames. –However, this part is surely non-emotional, so we take it away by symbol and remove it together with the username. Topics. –Every user can take part in discussions under a certain topic. To participate, users only need to include the topic in posts denoted by two # symbols, e.g. #Emotion Analysis#. Links. –Users can include links in their posts. The links will be converted into short links by the micro-blog platform to reduce occupied space. Position information. –Micro-blog platforms allow users to add position information at the end of posts, which will not help in emotion classification. 11

Emotion classification -- Feature extraction In all, emoticon features can express some more complex emotions, so extracting the emoticons features is important. By mining the POS features, we employ ICTCLAS package to segment posts and then extract adjectives, nouns, verbs, etc to form the feature space. Meanwhile, two semantic rules are applied. The first one is to extract repeated exclamation marks (!) and question marks (?). The second one is to put negative words and adjacent adjectives together, such as phrases have opposite meanings from the original adjectives. However, there may be adverbs between them, we set a distance threshold at 3 according to Chinese language habit. 12

Emotion classification -- Feature selection (1/3) More than 20,000 words are extracted in the last step, so it is necessary to select effective features from the original feature space. Here χ 2 -test, which is implemented by Weka, together with word frequency and PMI are adopted. (1) 13

Emotion classification -- Feature selection (2/3) 14

Emotion classification -- Feature selection (3/3) χ 2 -test can pick out the words that are highly correlated with classes. However, it can be affected by the frequency of the words, so word frequency ratio is adopted as auxiliary information. The selection of low-frequency words depends on PMI, as it is less sensitive to word frequencies. The words with higher PMI values than positive threshold are all picked out to form the low-frequency word set. 15

Emotion classification -- Classification (1/2) SVR allows us to dynamically select the classification threshold, rather than a fixed one in SVM. 16

Emotion classification -- Classification (2/2) The class with maximum distance between regression value and threshold is selected as the final result, as it is the most confident one. 17

Emotion component analysis (1/2) Usually, a micro-blog post contains more than one kind of emotions, so only classification results can not accurately reflect the emotion components. Based on the confidence concept in multi-class classification, we propose an ECA algorithm to detect the principal emotions and calculate ratios in the post. 3. This flower is picked at the side of road and brings me good mood. If you can find such little nice things in daily life, you will be a happy guy. 18

Emotion component analysis (2/2) 19

Experiment results and analysis -- Dataset (1/2) As there is no benchmark dataset for fine-grained emotion classification, we chose 9960 original Chinese micro-blog posts from Sina Weibo randomly and crawled them as dataset for keeping the authenticity and practicality of the posts. Two annotators finish the annotation separately. Disagreed annotations make up about 35%. This is acceptable considering the lack of clear boundaries between emotions and the existence of emotion combinations. Disagreed annotations are resolved by the first author, who chooses one of the competing labels as the final label. 20

Experiment results and analysis -- Dataset (2/2) 21

Experiment results and analysis -- Experimental group setting In the psychological emotion dictionary, there are more than 52,000 words, and we put these words into 6 groups. Each group can describe one kind of emotions. These emotions are happy, distressed, surprised, fearful, angry and disgusted. 22

Experiment results and analysis -- Level results (1/2) 23

Experiment results and analysis -- Level results (2/2) It proves the effect of our feature selection method, by which many noisy features are taken away and highly correlated features are retained. It turns out that the psychological emotion dictionary does have positive effect for classification, as it’s the only difference between them. It turns out that all classifiers perform well and good performance of the whole model can be expected. 24

Experiment results and analysis -- Hierarchical results (1/2) In hierarchical classification, each test example is classified from the top level successively to the bottom level. 25

Experiment results and analysis -- Hierarchical results (2/2) In flat classification, it is not easy for each classifier to distinguish between the examples belong to its class and other classes when given the whole dataset. Hierarchical classification, on the contrary, takes away most of irrelevant examples by upper-level classifiers and makes it easier for lower-level ones to classify. 26

Experiment results and analysis -- ECA results We adopt human judgement to judge the ECA results. Generally, if the analysis result of a post is supported by more than half of judgers, we would consider it plausible. 27

Application First, we will apply our algorithm to consumer behavior analysis. Second, we can also apply our algorithm to the effect analysis of commercial promotion. Third, it is possible for us to track the emotion changes characteristics of micro-blog users, so that we can track their happiness and the happiness index of certain areas and so on. 28

Conclusion (1/4) This paper focuses on emotion classification and emotion component analysis on Chinese micro-blog posts. We get good classification results on our dataset by applying several optimization methods, which are proved effective by the comparison between groups. We also propose an ECA algorithm, which can detect the four principal emotions in posts and calculate portions. 29

Conclusion (2/4) First, in the application area of social management, the government can find some existing problems by analyzing public emotions in social media. Second, in the process of segmentation, a psychological emotion dictionary is adopted in this paper for improving the effect of the algorithm, which has important scientific values both on social network knowledge discovery and data mining. Third, many researchers are now focusing on positive / negative or coarse-grained basic emotion classification with 6–7 classes, while in this classification procedure, a four-level fine-grained emotion hierarchy with 19 basic emotions is adopted. 30

Conclusion (3/4) First, this paper employs ICTCLAS package to segment Chinese posts, but because of more oral expressions in blogs, the effect of Chinese word segmentation is not very well. Second, due to the complexity of feature space in the process of classification, we need to perfect the algorithm of feature extraction and feature selection. Third, our ECA algorithm is designed based on the limited factors, although it has certain rationality, it could be improved in the future. 31

Conclusion (4/4) First, we will focus on making up a new dictionary, which contains more emotional words and slang on micro-blog, so that the effect of feature extraction can be improved. Second, we will also try to improve our ECA algorithm by adding more factors in order to get better analysis results, such as redesigning the calculation formula, normalizing the classification value and son on. Third, as for the sarcasm expressions on micro-blogs posts, they involve the problem of more deep semantic analysis, scenario analysis and contextual analysis, and we will put them as our further research content. First, our research could be applied to precision marketing for product recommendation. Second, our research can also be used to develop a system of opinion analysis system. 32