DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS Fabrício Benevenuto ∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Introduction to Bioinformatics
Hongyu Gao, Tuo Huang, Jun Hu, Jingnan Wang.  Boyd et al. Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication,
Evaluating Search Engine
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Evaluating Performance for Data Mining Techniques
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Evaluating Classifiers
Towards Boosting Video Popularity via Tag Selection Elizeu Santos-Neto, Tatiana Pontes, Jussara Almeida, Matei Ripeanu University of British Columbia -
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
1 Speaker : 童耀民 MA1G Authors: Ze Li Dept. of Electr. & Comput. Eng., Clemson Univ., Clemson, SC, USA Haiying Shen ; Hailang Wang ; Guoxin.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
A Geographical Characterization of YouTube: a Latin American View Fernando Duarte, Fabrício Benevenuto, Virgílio Almeida, Jussara Almeida Federal University.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Collusion-Resistance Misbehaving User Detection Schemes Speaker: Jing-Kai Lou 2015/10/131.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
The Tube Over Time: Characterizing Popularity Growth of YouTube Videos ` Abstract In this work, we characterize the growth patterns of video popularity.
SocialTube: P2P-assisted Video Sharing in Online Social Networks
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Computing and Information Sciences Kansas State University ANNIE Conference November 10, 2008 Predicting Links and Link Change in Friends Networks: Supervised.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
1 Link Privacy in Social Networks Aleksandra Korolova, Rajeev Motwani, Shubha U. Nabar CIKM’08 Advisor: Dr. Koh, JiaLing Speaker: Li, HueiJyun Date: 2009/3/30.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Uncovering Social Spammers: Social Honeypots + Machine Learning
Semi-Supervised Clustering
Evaluating Classifiers
Evaluation of IR Systems
Source: Procedia Computer Science(2015)70:
Elizeu Santos-Neto, Tatiana Pontes, Jussara Almeida, Matei Ripeanu
CSE 4705 Artificial Intelligence
Toshiyuki Shimizu (Kyoto University)
Critical Issues with Respect to Clustering
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS Fabrício Benevenuto ∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and Marcos Gonçalves Computer Science Department, Federal University of Minas Gerais Belo Horizonte, Brazil (SIR’09) Speaker : Yi-Ling Tai Date : 2009/09/28

OUTLINE Introduction User test collection Analyzing user behavior attributes Detecting spammer and promoters Evaluation metrics Experimental setup Classification Reducing attribute set Conclusions

INTRODUCTION YouTube is the most popular Online video social networks. It allows users to post a video as a response to a discussion topic. These features open opportunities for users to introduce polluted content into the system. Pollution – spread advertise to generate sales disseminate pornography compromise system reputation

INTRODUCTION Users cannot easily identify the pollution before watching, which also consumes system resources, especially bandwidth. This paper address the issue of detecting video spammers and promoters. Spammers post an unrelated video as response to a popular video topic to increase the likelihood of the response being viewed. Promoters post a large number of responses to boost the rank of the video topic.

INTRODUCTION Toward this end - crawl a large user data set from YouTube. “manually” classified user as legitimate, spammers and promoters. study attributes to distinguish different types of polluters. use a supervised classification algorithm to detect spammers and promoters.

USER TEST COLLECTION A YouTube video is a responded video or a video topic if it has at least one video response. A YouTube user is a responsive user if she has posted at least one video response. A responded user is someone who posted at least one responded video. Polluter is used to refer to either a spammer or a promoter.

CRAWLING YOUTUBE User interactions can be represented by a video response user graph. G = ( X, Y ) X is the union of all users who posted or received video responses. ( x 1, x 2 ) is a directed arc in Y, if user x 1 has responded to a video contributed by user x 2 To build the graph, this paper build a crawler that implements Algorithm 1.

CRAWLING YOUTUBE The sampling starts from a set of 88 seeds, consisting of the owners of the top-100 most responded videos of all time. The crawler follows links gathering information on a number of different attributes. 264,460 users 381,616 responded videos 701,950 video responses

CRAWLING YOUTUBE

BUILDING A TEST COLLECTION The main goal is to study the patterns and characteristics of each class of users. The collection should include the properties having a significant number of users of all three categories including, but not restricting to large amounts of pollution including a large number of legitimate users with different behavior randomly sampling may not achieve these properties.

BUILDING A TEST COLLECTION three strategies for user selection 1. different levels of interaction Four groups of users based on their in and out-degrees 100 users were randomly selected from each group 2. Aiming at the test collection with polluters Browsed responses of top 100 most responded videos, selecting suspect users. 3. randomly selected 300 users Who posted video responses to the top 100 most responded videos To minimize a possible bias by strategy2

BUILDING A TEST COLLECTION Each selected user was then manually classified. Three volunteers analyzed all video responses of each user to classify her into one of categories. Volunteers were instructed to favor legitimate users.

ANALYZING USER BEHAVIOR ATTRIBUTES We considered three attribute sets Video attributes Duration, number of views, commentaries received Rating, number of times to be selected favorite Number of honor and external links Three video groups of the user All video uploaded by the user Video responses Responded videos which this user response to summing up 42 video attributes for each user

ANALYZING USER BEHAVIOR ATTRIBUTES User attributes number of friends, number of videos uploaded, number of videos watched, number of videos added as favorite, numbers of video responses posted and received, numbers of subscriptions and subscribers, average time between video uploads, maximum number of videos uploaded in 24 hours.

ANALYZING USER BEHAVIOR ATTRIBUTES Social network attributes Clustering coefficient cc(i), is the ratio of the number of existing edges between i’s neighbors to the maximum possible number Betweenness Reciprocity Assortativity The ratio between the node (in/out) degree and the average (in/out)degree of its neighbors. UserRank

ANALYZING USER BEHAVIOR ATTRIBUTES two well known feature selection methods. Information gain (Chi Squared)

ANALYZING USER BEHAVIOR ATTRIBUTES

EVALUATION METRICS use the standard information retrieval metrics Recall Precision Micro-F1 first computing global precision and recall values for all classes. then calculating F1 Macro-F1 first calculating F1 values for each class in isolation then averaging over all classes

EVALUATION METRICS confusion matrix

EXPERIMENTAL SETUP libSVM - an open source SVM package allows searching for the best classifier parameters using the training data provides a series of optimizations, including normalization of all numerical attributes. 5-fold cross-validation. repeated 5 times with different seeds used to shuffle the original data set. producing 25 different results for each test.

TWO CLASSIFICATION STRATEGIES flat classification promoters (P), spammers (S), and legitimate users (L) hierarchical strategy first separate promoters (P) from non-promoters (NP) heavy (HP) and light promoters (LP) legitimate users (L) and spammers (S)

FLAT CLASSIFICATION confusion matrix obtained The numbers presented are percentages relative to the total number of users in each class. The diagonal indicates the recall in each class. no promoter was classified as legitimate user. 3.87% - videos actually acquired popularity. harder to distinguish them from spammers.

FLAT CLASSIFICATION 41.91% - Legitimate users post their video responses to popular responded videos(a typical behavior of spammers). Micro-F1 = 87.5, with per-class F1 values are 90.8, 63.7, and 92.3 Macro-F1 = 82.2

HIERARCHICAL CLASSIfiCATION Binary classification J parameter - one can give priority to one class (e.g., spammers) over the other (e.g., legitimate users). promoters VS non-promoters Macro-F1 = Micro-F1 = 99.17

NON-PROMOTERS We trained the classifier with the original training data without promoters. with J=1 J = % VS 1% J = % VS 9% The best solution depends on the system administrator’s objectives.

HEAVY AND LIGHT PROMOTERS Aggressiveness Maximum number of video responses posted in a 24- hour period. k-means clustering algorithm was used to separate promoters into two clusters. Average aggressiveness Light promoters = (CV=0.63) Heavy promoters = (CV=0.61)

HEAVY AND LIGHT PROMOTERS Binary classification retrained with the original training data containing only promoters

REDUCING THE ATTRIBUTE SET Two scenarios - Decreasing order of position in the χ2 ranking Evaluating classification when subsets of 10 attributes occupying contiguous positions

CONCLUSIONS An effective solution to detect spammers and promoters in online video social networks. Flat classification approach provides alternative to simply considering all users as legitimate. Hierarchical approach explores different classification tradeoffs and provides more flexibility for the application. Finally, we can produce significant benefits with only a small subset of less expensive attributes. Spammers and promoters will evolve and adapt to anti-pollution strategies, periodical assessment of the classification process may be necessary.