Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.

Slides:



Advertisements
Similar presentations
Machine Learning: Intro and Supervised Classification
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
An Overview of Machine Learning
Review Spam Detection via Temporal Pattern Discovery Sihong Xie, Guan Wang, Shuyang Lin, Philip S. Yu Department of Computer Science University of Illinois.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Sentiment Analysis Bing Liu University Of Illinois at Chicago
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.
Assuming normally distributed data! Naïve Bayes Classifier.
Alias Detection in Link Data Sets Master’s Thesis Paul Hsiung.
For use only with Perreault/Cannon/McCarthy or Perreault/McCarthy texts. © 2008 McGraw-Hill Companies, Inc. McGraw-Hill/Irwin Improving.
Agenda for January 25 th Administrative Items/Announcements Attendance Handouts: course enrollment, RPP instructions Course packs available for sale in.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Part I: Classification and Bayesian Learning
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Today Evaluation Measures Accuracy Significance Testing
Mining and Summarizing Customer Reviews
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Opinion Mining Using Econometrics: A Case Study on Reputation Systems Anindya Ghose, Panagiotis G. Ipeirotis, and Arun Sundararajan Leonard N. Stern School.
Verification & Validation
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
Marketing Research. Monday, February 23 Give a couple examples of Marketing Research. Give a couple examples of Marketing Research. Why do you think Marketing.
Eng.Mosab I. Tabash Applied Statistics. Eng.Mosab I. Tabash Session 1 : Lesson 1 IntroductiontoStatisticsIntroductiontoStatistics.
Analyzing and Interpreting Quantitative Data
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Designing Ranking Systems for Consumer Reviews: The Economic Impact of Customer Sentiment in Electronic Markets Anindya Ghose Panagiotis Ipeirotis Stern.
Marketing Research.
Machine learning system design Prioritizing what to work on
Products, Services and Brands: Building Customer Value.
Managerial Economics Demand Estimation & Forecasting.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Spam Detection Ethan Grefe December 13, 2013.
Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,
Product, Services, and Branding Strategy Chapter 8.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
4.4 Marketing Research.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. PPSS The situation in a statistical problem is that there is a population of interest, and a quantity or.
RESEARCH METHODS IN INDUSTRIAL PSYCHOLOGY & ORGANIZATION Pertemuan Matakuliah: D Sosiologi dan Psikologi Industri Tahun: Sep-2009.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Classification using Co-Training
Statistics for Business and Economics Module 1:Probability Theory and Statistical Inference Spring 2010 Lecture 4: Estimating parameters with confidence.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Experience Report: System Log Analysis for Anomaly Detection
Learning to Detect and Classify Malicious Executables in the Wild by J
7. Performance Measurement
Chapter 3: Cost Estimation Techniques
Evaluation – next steps
Sparsity Analysis of Term Weighting Schemes and Application to Text Classification Nataša Milić-Frayling,1 Dunja Mladenić,2 Janez Brank,2 Marko Grobelnik2.
Erasmus University Rotterdam
Analyzing and Interpreting Quantitative Data
Source: Procedia Computer Science(2015)70:
TED Talks – A Predictive Analysis Using Classification Algorithms
Product, Services, and Branding Strategy
iSRD Spam Review Detection with Imbalanced Data Distributions
Product, Services, and Branding Strategy
Text Mining Application Programming Chapter 9 Text Categorization
Happiness and Stocks Ali Javed, Tim Stevens
Presentation transcript:

Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago

Motivation Opinions from reviews –Used by both consumers and manufacturers –Significant impact on product sales Existing Work –Focus on extracting and summarizing opinions from reviews –Little knowledge about characteristics of reviews and behavior of reviewers –No study on trustworthiness of opinions –No quality control spam reviews

Review Spam Increasing mention in blogosphere Articles in leading news media –CNN, NYTimess Increasing number of customers vary of fake reviews (biased reviews, paid reviews) by leading PR firm Burson-Marsteller Fake/untruthful review to promote or damage a product’s reputation Different from finding usefulness of reviews

Different from other spam types Web Spam (Link spam, Content spam) In reviews –not much links –adding irrelevant words of little help Spam (Unsolicited commercial advertisements) –In reviews, advertisements not as frequent as in s relatively easy to detect

Overview Opinion Data and Analysis –Reviews, reviewers and products –Feedbacks, ratings Review Spam –Categorization of Review Spam –Analysis and Detection

Amazon Data June 2006 –5.8mil reviews, 1.2mil products and 2.1mil reviewers. A review has 8 parts Industry manufactured products “mProducts” e.g. electronics, computers, accessories, etc –228K reviews, 36K products and 165K reviewers.

Log-log plot Reviews, Reviewers and Products Fig. 1 reviews and reviewers Fig. 2 reviews and products Fig. 3 reviews and feedbacks

Observations Reviews & Reviewers 68% of reviewers wrote only one review Only 8% of the reviewers wrote at least 5 reviews Reviews & Products 50% of products have only one review Only 19% of the products have at least 5 reviews Reviews & Feedbacks Closely follows power law

Review Ratings Rating of 5 60% reviews 45% of products 59% of members Reviews and Feedbacks 1 st review – 80% positive feedbacks 10 th review – 70% positive feedbacks

Duplicate Reviews Two reviews which have similar content are called duplicates

Members who duplicated reviews 10% of reviewers with more than one review (~650K) wrote duplicate reviews 40% of the times exact duplicates

Type of duplicates 1.Same userid, same product 2.Different userid, same product 3.Same userid, different product 4.Different userid, different product Types of Duplicate Reviews

Categorization of Review Spam Type 1 (Untruthful Opinions) Ex: Type 2 (Reviews on Brands Only) Ex: “I don’t trust HP and never bought anything from them” Type 3 (Non-reviews) –Advertisements Ex: “Detailed product specs: g, IMR compliant, …” “…buy this product at: compuplus.com” –Other non-reviews Ex: “What port is it for” “The other review is too funny” “Go Eagles go”

Spam Detection Type 2 and Type 3 spam reviews –Supervised learning Type 1 spam reviews –Manual labeling very difficult –Propose to use duplicate and near-duplicate reviews

Detecting Type 2 & Type 3 Spam Reviews Binary classification –Logistic Regression Probabilistic estimates Practical applications, like give weights to each review, rank them, etc Poor performance on other models naïve Bayes, SVM and Decision Trees

Features Construction Three types –Review centric, reviewer centric and product centric Total 32 features –Rating related features Average rating, standard deviation, etc –Feedback related features Percentage of positive feedbacks, total feedbacks, etc –Textual Features Opinion words [Hu, Liu ‘04], numerals, capitals, cosine similarity, etc –Other features Length and position of review Sales rank, price, etc

Experimental Results Evaluation criteria –Area Under Curve (AUC) –10-fold cross validation High AUC -> Easy to detect Equally well on type 2 and type 3 spam text features alone not sufficient Feedbacks unhelpful (feedback spam)

Type 1 Spam Reviews Hype spam – promote one’s own product Defaming spam – defame one’s competitors product Very hard to detect manually Harmful Regions

Predictive Power of Duplicates Representative of all kinds of spam Only 3% duplicates accidental Duplicates as positive examples, rest of the reviews as negative examples –good predictive power –How to check if it can detect type 1 reviews? (outlier reviews)

Outlier Reviews –Reviews which deviate from average product rating –Necessary (but not sufficient) condition for harmful spam reviews Predicting outlier reviews –Run logistic regression model using duplicate reviews (without rating related features) –Lift curve analysis

Lift Curve for outlier reviews Biased reviewer -> all good or bad reviews on products of a brand -ve deviation reviews more likely to be spams –Biased reviews most likely +ve deviation reviews least likely to be spams except, –average reviews on bad products –Biased reviewers

Other Interesting Outlier Reviews –Only reviews –Reviews from top ranked members –Reviews with different feedbacks –Reviews on products with different sales ranks “If model able to predicts outlier reviews, then with some degree of confidence we can say that it will predict harmful spam reviews too”

Only Reviews 46% of reviewed products have only one review Only reviews have high lift curve

Reviews from Top-Ranked Reviewers Reviews by top ranked reviewers given higher probabilities of spam –Top ranked members write larger number reviews –Deviate a lot from product rating, write a lot of only reviews

Reviews with different levels of feedbacks Random distribution –Spam reviews can get good feedbacks

Reviews of products with varied sales ranks Product sales rank –Important feature High sales rank – low levels of spam Spam activities linked to low selling products

Conclusions Review Spam and Detection Categorization into three types Type 2 and 3 easy to detect Type 1 difficult to label manually –Proposed to use duplicate reviews for detecting type 1 spam –Predictive power on outlier reviews –Analyze other interesting outlier reviews Questions?