Opinion Mining and Topic Categorization with Novel Term Weighting Roman Sergienko, Ph.D student Tatiana Gasanova, Ph.D student Ulm University, Germany Shaknaz Akhmedova, Ph.D. student Siberian State Aerospace University, Krasnoyarsk, Russia
Contents Motivation Databases Text preprocessing methods The novel term weighting method Features selection Classification algorithms Results of numerical experiments Conclusions 2
Motivation The goal of the work is to evaluate the competitiveness of the novel term weighting in comparison with the standard techniques for opining mining and topic categorization. The criteria are: 1)Macro F-measure for the test set 2)Computational time 3
Databases: DEFT’07 and DEFT’08 4 CorpusSizeClasses BooksTrain size = 2074 Test size = 1386 Vocabulary = : negative, 1: neutral, 2: positive GamesTrain size = 2537 Test size = 1694 Vocabulary = : negative, 1: neutral, 2: positive DebatesTrain size = Test size = Vocabulary = : against, 1: for CorpusSizeClasses T1Train size = Test size = Vocabulary = : Sport, 1: Economy, 2: Art, 3: Television T2Train size = Test size = Vocabulary = : France, 1: International, 2: Literature, 3: Science, 4: Society
The existing text preprocessing methods Binary preprocessing TF-IDF (Salton and Buckley, 1988) 5 Confident Weights (Soucy and Mineau, 2005)
The novel term weighting method 6 L – the number of classes; n i – the number of instances of the i-th class; N ji – the number of j-th word occurrence in all instances of the i-th class; T ji =N ji /n i – the relative frequency of j-th word occurrence in the i-th class; Rj=max i T ji, S j =arg(max i T ji ) – the number of class which we assign to j-th word.
Features selection 1)Calculating a relative frequency for each word in the each class 2)Choice for each word the class with the maximum relative frequency 3)For each classification utterance calculating sums of weights of words which belong to each class 4)Number of attributes = number of classes 7
Classification algorithms 8
Computational effectiveness 9 DEFT’07 DEFT’08
The best values of F-measure 10 ProblemF- measure The best known value Term weighting method Classification algorithm Books The novel TWSVM Games ConfWeightk-NN Debates ConfWeightSVM T The novel TWSVM T The novel TWSVM
Comparison of ConfWeight and the novel term weighting 11 ProblemConfWeightThe novel TW Difference Books Games Debates T T
Conclusions The novel term weighting method gives similar or better classification quality than the ConfWeight method but it requires the same amount of time as TF-IDF. 12