MACHINE LEARNING CLASSIFICATION OF USER INTERESTS ACROSS LANGUAGES AND SOCIAL NETWORKS Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev Tyumen State University
Assumption: ML classification of user interests does not depend on the language and the network (...and, probably, the interest itself)
Dataset: https://github.com/evrog/TSAAP Krippendorff’s α=0.82 (>0.8) No. of pages Vkontakte Russian Football 39 Rock Music 109 Vegetarianism 127 Twitter Russian 33 37 32 Twitter English 97 96 100
Normalization & Lemmatization Own tweet preprocessing software: https://github.com/evrog/PunFields/blob/master/preprocessing_tweet.py English texts => NLTK Lemmatizer; Russian texts => Pymystem3 We do not exclude stop-words!
Interclass classification Cross-validation: 200 texts of different length (1800 texts, in sum), average F1-score in 5 folds Algorithms: Support Vector Machine, Neural Network, Naive Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors Optimization parameters in Scikit-Learn: four kernel functions: linear, polynomial, Radial Basis Function, sigmoid in SVM; Bernoulli, Multinomial, and Gaussian variants of Naive Bayes; Multi-layer Perceptron (NN): 1 hidden layer of 100 neurons, two solver functions (lbfgs and adam); three data models...
Data Models Bernoulli - absence/presence of a word (0 or 1); Frequency distribution - presence of a word denoted by its frequency in the training vocabulary (integer [0;+∞)); Normalized frequency - presence of a word denoted by normalized frequency in the training vocabulary in the interval [0;1].
Results-1 Lemmatization slightly increases the performance (by about 3%): ∑ F1-scores => 262.752 versus 254.186. Effectiveness of the Bernoulli model: (by mode) 18 versus 4 scores of 1.0; (by mean) 0.845 versus 0.753 plain, 0.795 normalized. Logistic Regression with Bernoulli model: ∑ F1-scores = 17.71 versus 17.664 Neural Network (lbfgs) with Bernoulli model (no need to add layers…) & 17.5 Multinomial Bayes, plain.
MaI Total Vk Ru T Ru T En Vk, xAVE T, xAVE Ru, xAVE En, xAVE Normalized texts F 33.976 10.24 11.826 11.91 0.853 0.989 0.919 0.993 R 33.138 10.064 11.334 11.74 0.839 0.961 0.892 0.978 V 32.906 9.808 11.302 10.796 0.817 0.962 0.88 0.983 Lemmatized texts 34.282 10.43 11.932 11.92 0.869 0.994 0.932 33.942 10.398 11.624 11.754 0.867 0.974 0.918 0.98 33.708 10.272 11.622 11.814 0.856 0.977 0.912 0.985
Results-2: Mann-Whitney U DIFFERENCE BY NETWORK: Vkontakte-Russian underscores compared to the Twitter-English (pvalue=1.0, greater) and Twitter-Russian (pvalue=0.99, greater). DIFFERENCE BY INTEREST: Vegetarianism and Rock Music are very likely to score less than Football: pvalue=0.99, greater, and pvalue=0.99, greater.
Thank you!