Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev

Slides:



Advertisements
Similar presentations
ECG Signal processing (2)
Advertisements

Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.
1 Image Classification MSc Image Processing Assignment March 2003.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Support Vector Machines
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Supervised Learning Recap
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
ANALYTICS BUSINESS INTELLIGENCE SOFTWARE STATISTICS Kreara Solutions | 9 years | 60 members | ISO 9001:2008.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
An Example of Course Project Face Identification.
Ajay Kumar, Member, IEEE, and David Zhang, Senior Member, IEEE.
Collating Social Network Profiles. Objective 2 System.
An informal description of artificial neural networks John MacCormick.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Support Vector Machines Optimization objective Machine Learning.
Classification of Breast Cancer Cells Using Artificial Neural Networks and Support Vector Machines Emmanuel Contreras Guzman.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Big Data Processing of School Shooting Archives
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intrusion Detection using Deep Neural Networks
Machine Learning Models
Building Machine Learning System with Python
Course Review Questions will not be all on one topic, i.e. questions may have parts covering more than one area.
Deep Feedforward Networks
An Empirical Comparison of Supervised Learning Algorithms
Machine Learning & Deep Learning
ECE 5424: Introduction to Machine Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Pfizer HTS Machine Learning Algorithms: November 2002
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Schizophrenia Classification Using
Data Mining Lecture 11.
NEURAL NETWORK APPROACHES FOR AUTOMOBILE MPG PREDICTION
Overview of Supervised Learning
with observed random variables
Tomasz Maszczyk and Włodzisław Duch Department of Informatics,
Hyperparameters, bias-variance tradeoff, validation
Machine Learning with Weka
Project 1: Text Classification by Neural Networks
Classification Boundaries
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Overview of deep learning
Linear Discrimination
Prediction Networks Prediction A simple example (section 3.7.3)
Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara
Support Vector Machines 2
LHC beam mode classification
Credit Card Fraudulent Transaction Detection
Machine Learning for Cyber
Presentation transcript:

MACHINE LEARNING CLASSIFICATION OF USER INTERESTS ACROSS LANGUAGES AND SOCIAL NETWORKS Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev Tyumen State University

Assumption: ML classification of user interests does not depend on the language and the network (...and, probably, the interest itself)

Dataset: https://github.com/evrog/TSAAP Krippendorff’s α=0.82 (>0.8) No. of pages Vkontakte Russian Football 39 Rock Music 109 Vegetarianism 127 Twitter Russian 33 37 32 Twitter English 97 96 100

Normalization & Lemmatization Own tweet preprocessing software: https://github.com/evrog/PunFields/blob/master/preprocessing_tweet.py English texts => NLTK Lemmatizer; Russian texts => Pymystem3 We do not exclude stop-words!

Interclass classification Cross-validation: 200 texts of different length (1800 texts, in sum), average F1-score in 5 folds Algorithms: Support Vector Machine, Neural Network, Naive Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors Optimization parameters in Scikit-Learn: four kernel functions: linear, polynomial, Radial Basis Function, sigmoid in SVM; Bernoulli, Multinomial, and Gaussian variants of Naive Bayes; Multi-layer Perceptron (NN): 1 hidden layer of 100 neurons, two solver functions (lbfgs and adam); three data models...

Data Models Bernoulli - absence/presence of a word (0 or 1); Frequency distribution - presence of a word denoted by its frequency in the training vocabulary (integer [0;+∞)); Normalized frequency - presence of a word denoted by normalized frequency in the training vocabulary in the interval [0;1].

Results-1 Lemmatization slightly increases the performance (by about 3%): ∑ F1-scores => 262.752 versus 254.186. Effectiveness of the Bernoulli model: (by mode) 18 versus 4 scores of 1.0; (by mean) 0.845 versus 0.753 plain, 0.795 normalized. Logistic Regression with Bernoulli model: ∑ F1-scores = 17.71 versus 17.664 Neural Network (lbfgs) with Bernoulli model (no need to add layers…) & 17.5 Multinomial Bayes, plain.

MaI Total Vk Ru T Ru T En Vk, xAVE T, xAVE Ru, xAVE En, xAVE Normalized texts F 33.976 10.24 11.826 11.91 0.853 0.989 0.919 0.993 R 33.138 10.064 11.334 11.74 0.839 0.961 0.892 0.978 V 32.906 9.808 11.302 10.796 0.817 0.962 0.88 0.983 Lemmatized texts 34.282 10.43 11.932 11.92 0.869 0.994 0.932 33.942 10.398 11.624 11.754 0.867 0.974 0.918 0.98 33.708 10.272 11.622 11.814 0.856 0.977 0.912 0.985

Results-2: Mann-Whitney U DIFFERENCE BY NETWORK: Vkontakte-Russian underscores compared to the Twitter-English (pvalue=1.0, greater) and Twitter-Russian (pvalue=0.99, greater). DIFFERENCE BY INTEREST: Vegetarianism and Rock Music are very likely to score less than Football: pvalue=0.99, greater, and pvalue=0.99, greater.

Thank you!