iSRD Spam Review Detection with Imbalanced Data Distributions

Slides:

Advertisements

Similar presentations

Imbalanced data David Kauchak CS 451 – Fall 2013.

Advertisements

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.

Three kinds of learning

Presented by Zeehasham Rasheed

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.

Evaluating Classifiers

Issues with Data Mining

1 Mining with Noise Knowledge: Error Awareness Data Mining Xindong Wu Department of Computer Science University of Vermont, USA; Hong Kong Polytechnic.

Extreme Re-balancing for SVMs and other classifiers Presenter: Cui, Shuoyang 2005/03/02 Authors: Bhavani Raskutti & Adam Kowalczyk Telstra Croporation.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.

CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Class Imbalance in Text Classification

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Classification using Co-Training

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

Great Workshop La Palma -June 2011 Handling Imbalanced Datasets in Multistage Classification Mauro López Centro de Astrobiología.

Detecting Web Attacks Using Multi-Stage Log Analysis

Ensemble Classifiers.

Machine Learning: Ensemble Methods

Experience Report: System Log Analysis for Anomaly Detection

Learning to Detect and Classify Malicious Executables in the Wild by J

7. Performance Measurement

Recent Trends in Text Mining

Robert Anderson SAS JMP

Sentiment Analysis of Twitter Messages Using Word2Vec

Bagging and Random Forests

Sentiment analysis algorithms and applications: A survey

Text Mining CSC 600: Data Mining Class 20.

Advanced data mining with TagHelper and Weka

Bag-of-Visual-Words Based Feature Extraction

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Erasmus University Rotterdam

Performance Measures II

Source: Procedia Computer Science（2015）70:

Cost-Sensitive Learning

CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,

Data Mining Practical Machine Learning Tools and Techniques

Improved Rooftop Detection in Aerial Images with Machine Learning

A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:

Cost-Sensitive Learning

Prepared by: Mahmoud Rafeek Al-Farra

Introduction to Data Mining, 2nd Edition

Classification of class-imbalanced data

CSCI N317 Computation for Scientific Applications Unit Weka

Ensemble learning Reminder - Bagging of Trees Random Forest

Text Mining CSC 576: Data Mining.

Basics of ML Rohan Suri.

Introduction to Sentiment Analysis

COSC 4368 Intro Supervised Learning Organization

Outlines Introduction & Objectives Methodology & Workflow

Naïve Bayes Classifier

Presentation transcript:

iSRD Spam Review Detection with Imbalanced Data Distributions Yan Zhu

Agenda Overview Objective Sentiment Analysis And Imbalanced Data Distributions ISRD: Methodology Experiments And Results Conclusion

Overview The increasing use of the internet in all aspects of our lives, made us rely on the internet for doing all of our daily life activities. Online product reviews are becoming vital for customers to obtain additional user-centered knowledge about the products. However, some vendors are paying customers to write good reviews so they can boost their revenue through online sales

Overview Examples of spam reviews include untruthful/fake review reports and review reports irrelevant to the products (such as an advertisement). One of the most effective ways to distinguish spam and non-spam reviews is by using machine learning techniques. Non-spam reviews are often the majority population, and the spam or fake reviews are relatively rare and difficult to obtain.

Objective In the paper, the authors discussed sentiment analysis techniques for opinion mining in order to convey user’s sentiments by using document sentiment classification based on supervised learning, and feature based sentiment analysis. To solve the problem of unbalanced data set, we develop iSRD, which is a new classifier framework that deals with imbalanced review data

Sentiment Analysis And Imbalanced Data Distributions Sentiments often hold the real words that people wants to deliver Successfully analyzing and understanding the sentiments are useful for many domain specific applications Sentiment analysis is categorized into three levels including document level, sentence level and aspect level. Naïve Bayes and Support vector machine proved their abilities to give good results in supervised learning, by using bag-of-words as features

Sentiment Analysis And Imbalanced Data Distributions Review spams has been identified into three types: fake reviews, including untruthful reviews; reviews about brand only, that describe and comments on the brand rather than the product or service; the non-review, that are not reviews or might be irrelevant text , questions or advertisements.

Sentiment Analysis And Imbalanced Data Distributions The main challenge is that fake reviews are very hard to detect even manually, because there is no clear way to distinguish between fake and true reviews. Machine learning is a best suitable technique that achieves a good generalization from the provided representations and learns the behavior from the given examples in order to classify unseen examples.

Sentiment Analysis And Imbalanced Data Distributions In machine learning, imbalanced data distributions often happens because of the lack of examples from the minority class. The problem of imbalanced data appears when users intend to train a good classifier from imbalanced training data, where classifiers are inherently biased toward the majority class, leading to incorrect generalization rules.

Sentiment Analysis And Imbalanced Data Distributions Instead of the accuracy, one should focus on precision, recall, sensitivity and specificity, which give us accurate performance for the minority class

Sentiment Analysis And Imbalanced Data Distributions Many methods exist to handle data with imbalanced distributions. Examples include sampling and re-weighting. When using those approaches, boosting and bagging are often used to combine classifiers trained from sampled datasets for prediction.

ISRD: Methodology The main theme is to use under sampling to generate a relatively balanced data set, and then user classifiers trained from sampled datasets for prediction. Repeat the sampling for multiple times , each of which will generate a balanced dataset to reduce the sample selection bias. For each balanced dataset, train a classifier, and use the ensemble of the classifiers from all sampled datasets for spam classification.

The proposed iSRD framework for spam review detection with imbalanced data distributions. #S and #N denote the number of Spam and Non-Spam in the dataset. The benchmark dataset is first split into an FIT (training) and a test set. For the training set, we use to change the data imbalance levels so we can observe algorithm performance with respect to different data imbalance conditions. After that, we use random under-sampling to generate m copies of balanced datasets, where m here is equal to 10. We train a classifier from each balanced dataset, and use the majority voting to classify reviews in the test set.

ISRD: Methodology First split the dataset into a training (FIT) and a test set, where training and the test sets contain similar data imbalance ratios. Use β to change the data imbalance levels in the FIT set to evaluate the performance for different data imbalance levels. After we obtain the altered FIT dataset, we apply randomly under-sampling to generate balanced dataset

ISRD: Methodology We trained a classifier from each balanced datasets, and then use the majority voting of the m classifiers to predict the class labels of the reviews in the test sets. In order to validate the performance of the above design, our experiments will record the performance of each classifier against the same supplied test set and then compare the results for validation.

Experiments Data Collection Collected review reports for multiple hotels located at different cities and different countries Two major data sources Opinion Based Entity Ranking Project Dataset (2012) Deceptive or fake reviews from the Deceptive Opinion Spam Corpus v1.4, which are gathered from Amazon MTurk heterogeneous

Data Collection Data Preprocessing Form a dataset with two columns where each row denotes a review, and the first column includes all text of the review, and the second column shows the class label Convert texts into bag-of-words representation using StringToWordVector filter in Weka Store the dataset as an ARFF file to be used in the following steps

Data Collection Data Sampling Randomly select examples and create an imbalanced test dataset with a very close imbalance ratio as the training set.

Data Collection Data Sampling Build five Fit datasets from the original Fit dataset by Under- sampling minority class (spam), but keeping all non-spam examples.

Data Collection Data Sampling For each of the Fit dataset, we will then apply random under- sampling to the majority class to create a set of balanced datasets. create 10 copies of randomly sampled datasets, each including the same number of positive and negative samples.

Results Instead of looking and examining the accuracy, other measures, such as precision, recall, sensitivity and specificity, will provide more accurate performance to evaluate the algorithm performance on the minority samples. Compare the performance of our model with a decision tree based classifier (C4.5) by using different statistical measurements.

Results The class of interest here is the spam class which is actually here the Positive class, so our interest is to increase the True Positive Rate and decrease the False Positive Rate

Results

Results

Results

Results

Results

Conclusion In this research we have addressed the problem of detecting spam online reviews from imbalanced data distributions proposed a new classifier technique to overcome the problem of imbalanced data distributions for review spam detection. proposed to use random under-sampling to generate balanced training sets.

Conclusion The experiments show that our proposed method, iSRD, significantly outperforms baseline classifier C4.5 in terms of TNR, FNR, Sensitivity, AUC and PRC, which are the common measures used for imbalanced data evaluation.

Reference Al Najada, H., & Zhu, X. (2014, August). iSRD: Spam review detection with imbalanced data distributions. In Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on (pp. 553-560). IEEE.