Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev

Slides:

Advertisements

Similar presentations

ECG Signal processing (2)

Advertisements

Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.

1 Image Classification MSc Image Processing Assignment March 2003.

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

Support Vector Machines

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.

Supervised Learning Recap

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.

Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.

COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

ANALYTICS BUSINESS INTELLIGENCE SOFTWARE STATISTICS Kreara Solutions | 9 years | 60 members | ISO 9001:2008.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

An Example of Course Project Face Identification.

Ajay Kumar, Member, IEEE, and David Zhang, Senior Member, IEEE.

Collating Social Network Profiles. Objective 2 System.

An informal description of artificial neural networks John MacCormick.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Support Vector Machines Optimization objective Machine Learning.

Classification of Breast Cancer Cells Using Artificial Neural Networks and Support Vector Machines Emmanuel Contreras Guzman.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Big Data Processing of School Shooting Archives

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Intrusion Detection using Deep Neural Networks

Machine Learning Models

Building Machine Learning System with Python

Course Review Questions will not be all on one topic, i.e. questions may have parts covering more than one area.

Deep Feedforward Networks

An Empirical Comparison of Supervised Learning Algorithms

Machine Learning & Deep Learning

ECE 5424: Introduction to Machine Learning

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Pfizer HTS Machine Learning Algorithms: November 2002

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Schizophrenia Classification Using

Data Mining Lecture 11.

NEURAL NETWORK APPROACHES FOR AUTOMOBILE MPG PREDICTION

Overview of Supervised Learning

with observed random variables

Tomasz Maszczyk and Włodzisław Duch Department of Informatics,

Hyperparameters, bias-variance tradeoff, validation

Machine Learning with Weka

Project 1: Text Classification by Neural Networks

Classification Boundaries

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Overview of deep learning

Linear Discrimination

Prediction Networks Prediction A simple example (section 3.7.3)

Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara

Support Vector Machines 2

LHC beam mode classification

Credit Card Fraudulent Transaction Detection

Machine Learning for Cyber

Presentation transcript:

MACHINE LEARNING CLASSIFICATION OF USER INTERESTS ACROSS LANGUAGES AND SOCIAL NETWORKS Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev Tyumen State University

Assumption: ML classification of user interests does not depend on the language and the network (...and, probably, the interest itself)

Dataset: https://github.com/evrog/TSAAP Krippendorff’s α=0.82 (>0.8) No. of pages Vkontakte Russian Football 39 Rock Music 109 Vegetarianism 127 Twitter Russian 33 37 32 Twitter English 97 96 100

Normalization & Lemmatization Own tweet preprocessing software: https://github.com/evrog/PunFields/blob/master/preprocessing_tweet.py English texts => NLTK Lemmatizer; Russian texts => Pymystem3 We do not exclude stop-words!

Interclass classification Cross-validation: 200 texts of different length (1800 texts, in sum), average F1-score in 5 folds Algorithms: Support Vector Machine, Neural Network, Naive Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors Optimization parameters in Scikit-Learn: four kernel functions: linear, polynomial, Radial Basis Function, sigmoid in SVM; Bernoulli, Multinomial, and Gaussian variants of Naive Bayes; Multi-layer Perceptron (NN): 1 hidden layer of 100 neurons, two solver functions (lbfgs and adam); three data models...

Data Models Bernoulli - absence/presence of a word (0 or 1); Frequency distribution - presence of a word denoted by its frequency in the training vocabulary (integer [0;+∞)); Normalized frequency - presence of a word denoted by normalized frequency in the training vocabulary in the interval [0;1].

Results-1 Lemmatization slightly increases the performance (by about 3%): ∑ F1-scores => 262.752 versus 254.186. Effectiveness of the Bernoulli model: (by mode) 18 versus 4 scores of 1.0; (by mean) 0.845 versus 0.753 plain, 0.795 normalized. Logistic Regression with Bernoulli model: ∑ F1-scores = 17.71 versus 17.664 Neural Network (lbfgs) with Bernoulli model (no need to add layers…) & 17.5 Multinomial Bayes, plain.

MaI Total Vk Ru T Ru T En Vk, xAVE T, xAVE Ru, xAVE En, xAVE Normalized texts F 33.976 10.24 11.826 11.91 0.853 0.989 0.919 0.993 R 33.138 10.064 11.334 11.74 0.839 0.961 0.892 0.978 V 32.906 9.808 11.302 10.796 0.817 0.962 0.88 0.983 Lemmatized texts 34.282 10.43 11.932 11.92 0.869 0.994 0.932 33.942 10.398 11.624 11.754 0.867 0.974 0.918 0.98 33.708 10.272 11.622 11.814 0.856 0.977 0.912 0.985

Results-2: Mann-Whitney U DIFFERENCE BY NETWORK: Vkontakte-Russian underscores compared to the Twitter-English (pvalue=1.0, greater) and Twitter-Russian (pvalue=0.99, greater). DIFFERENCE BY INTEREST: Vegetarianism and Rock Music are very likely to score less than Football: pvalue=0.99, greater, and pvalue=0.99, greater.

Thank you!