Spam ain’t as Diverse as It Seems: Throttling OSN Spam with Templates Underneath Hongyu Gao, Yi Yang, Kai Bu, Yan Chen, Doug Downey, Kathy Lee, Alok Choudhary.

Slides:



Advertisements
Similar presentations
Using the Self Service BMC Helpdesk
Advertisements

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Influence and Passivity in Social Media Daniel M. Romero, Wojciech Galuba, Sitaram Asur, and Bernardo A. Huberman Social Computing Lab, HP Labs.
SOCIAL MEDIA & PHYSICAL ACTIVITY PROMOTION: MAKING THE CONNECTIONS Presented by: Sandra De Freitas
Evaluation of segmentation. Example Reference standard & segmentation.
FRAppE: Detecting Malicious Facebook Applications
Design and Evaluation of a Real-Time URL Spam Filtering Service
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Search Engines and Information Retrieval
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.
Social Media Motion: How to Get Started & Keep Going With Facebook, Twitter & More Presented by Eli Lilly and Company Hosted by Rob Robinson McNeely Pigott.
User Interactions in OSNs Evangelia Skiani. Do you have a Facebook account? Why? How likely to know ALL your friends? Why confirm requests? Why not remove.
Anthony Bonomi, Amber Heeg, Elizabeth Newton, Bianca Robinson & Marzi Shabani.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Towards Online Spam Filtering in Social Networks Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia and Alok Choudhary Lab for Internet and Security Technology.
1 Authors: Anirudh Ramachandran, Nick Feamster, and Santosh Vempala Publication: ACM Conference on Computer and Communications Security 2007 Presenter:
Top 5 Facebook Tips Mark Smith Rosemary Turner. What is Facebook? Users create a personalised profile for themselves and then add people as friends to.
Social networking FACEBOOK AND TWITTER. Then In the beginning of Facebook, there were very few features. There were no status updates, messages, photo.
+ The Future of Social Media By Abigail Boghurst.
S PAMMING B OTNETS : S IGNATURES AND C HARACTERISTICS Introduction of AutoRE Framework.
Detecting Spammers on Social Networks Gianluca Stringhini, Christopher Kruegel, Giovanni Vigna (University of California) Annual Computer Security Applications.
Authors: Gianluca Stringhini Christopher Kruegel Giovanni Vigna University of California, Santa Barbara Presenter: Justin Rhodes.
Search Engines and Information Retrieval Chapter 1.
Network and Systems Security By, Vigya Sharma (2011MCS2564) FaisalAlam(2011MCS2608) DETECTING SPAMMERS ON SOCIAL NETWORKS.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
SURF:SURF: Detecting and Measuring Search Poisoning Long Lu, Roberto Perdisci, and Wenke Lee Georgia Tech and University of Georgia.
Suspended Accounts in Retrospect: An Analysis of Twitter Spam Kurt Thomas, Chris Grier, Vern Paxson, Dawn Song University of California, Berkeley International.
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
Understanding Cross-site Linking in Online Social Networks Yang Chen 1, Chenfan Zhuang 2, Qiang Cao 1, Pan Hui 3 1 Duke University 2 Tsinghua University.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Using Identity Credential Usage Logs to Detect Anomalous Service Accesses Daisuke Mashima Dr. Mustaque Ahamad College of Computing Georgia Institute of.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Uncovering Social Network Sybils in the Wild Zhi YangChristo WilsonXiao Wang Peking UniversityUC Santa BarbaraPeking University Tingting GaoBen Y. ZhaoYafei.
Microblogs: Information and Social Network Huang Yuxin.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. SIGCOMM, Presented.
Prediction of Influencers from Word Use Chan Shing Hei.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
Enquiring Minds: Early Detection of Rumors in Social Media from Enquiry Posts Zhe Zhao Paul Resnick Qiaozhu Mei Presentation Group 2.
Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Botnet Judo: Fighting Spam with Itself.
Twitter Games: How Successful Spammers Pick Targets Vasumathi Sridharan, Vaibhav Shankar, Minaxi Gupta School of Informatics and Computing, Indiana University.
Group Activity I: Spot the Fake ESSIR 2015 Thessaloniki, Sep 3-4, 2015.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
We.b : The web of short URLs Demetris Antoniades, lasonas Polakis, Gerogios Kontaxis, Elias Athansapoulos, Sotiris loannidis, Evangelos P.Markatos, Thomas.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
Twitter for Business 140 Character Advertising and Customer Engagement.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
Alvin CHAN Kay CHEUNG Alex YING Relationship between Twitter Events and Real-life.
CrowdTarget: Target-based Detection of Crowdturfing in Online Social Networks Jenny (Bom Yi) Lee.
Dec 14, 2014, Harvard University
Topic Modeling for Short Texts with Auxiliary Word Embeddings
Uncovering Social Spammers: Social Honeypots + Machine Learning
Hummingbird: Privacy at the time of Twitter
Erasmus University Rotterdam
Lab for Internet and Security Technology Yan Chen
Social Knowledge Mining
A Network Science Approach to Fake News Detection on Social Media
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Text Mining & Natural Language Processing
GANG: Detecting Fraudulent Users in OSNs
Pei Lee, ICDE 2014, Chicago, IL, USA
Characterizing Pixel Tracking through the Lens of Disposable Services
Presentation transcript:

Spam ain’t as Diverse as It Seems: Throttling OSN Spam with Templates Underneath Hongyu Gao, Yi Yang, Kai Bu, Yan Chen, Doug Downey, Kathy Lee, Alok Choudhary Northwestern University, USA Zhejiang University, China

Among world’s most visited websites by Alexa Background billion monthly active users by Jul million users by Oct million users by Nov 2014

Background 3

44 Scary Twitter spam stats billion tweets posted to Twitter every day are spam percent of Twitter’s user base is bots and spam bots Background

Our Prior OSN Security Work 5 First study to o ffline detecting and characterizing Social Spam Campaigns (SIGCOMM IMC 2010) Largest scale experiment on Facebook then 3.5M user profiles, 187M wall posts Confirm spam campaigns in the wild. 200K spam wall posts in 19 significant campaigns. Featured in Wall Street Journal, MIT Technology Review and ACM Tech News Online spam campaign discovery (NDSS 2012) Mostly use non-semantics information, syntactic clustering

66 Measuring Trend of Twitter Spam Download tweets containing popular hashtags Visit Twitter retrospectively to identify suspended accounts 2011 Twitter data: –17 Million tweets –558,706 spam tweets (>3%) How Are the Spam Tweets Generated?

77 A macro sequence (m 1, m 2, …, m k ) Each macro instantiates differently during spam generation Template Model Macro 1 Macro 2 Macro 3 Beppe Signorimaking out with another man -URL Jason Isaacsmaking out with another man -URL Beppe Signoriis really gay, look at this videoURL Jason Isaacsis really gay, look at this videoURL RIP Jonas Bevacquais really gay, look at this videoURL Template = celebrity names + actions + URL

88 The majority of spam is generated with underlying templates We collect a smaller 2012 Twitter data containing 46,891 spam tweets The prevalence of template-based spam is persistent Semi-automated Spam Measurement Spam dataWith TemplateParaphraseNo-contentOthers %14.7%8.4%13.9% %12.9%0.3%18.5% Syntactic only detection is not sufficient!

99 Extract spam template in real time Fight spam with its own template Detect multiple spam templates simultaneously Semantics Based Spam Detection

10 Absence of invariant substring in template –Prior study assumes the existence of invariant substrings. [Pitsillidis NDSS’10][Zhang NDSS’14] Prevalence of noise –Spammers extensively add semantically unrelated noise words into spam messages. Spam heterogeneity –It is hard to obtain a training set containing spam instantiating a single template in practice. Challenges

11 Absence of invariant substring in template –Spam template generation without the need for invariant substring. Prevalence of noise –Automated noise labeling to identify and exclude noise words from template generation. Spam heterogeneity –Cluster and refine. Solutions

12 Real-time detection The auxiliary spam filter supplies training spam samples –Could use black list or any other spam detection systems –Heterogeneous filters to avoid evasion Template Generation/Matching Module

13 Single Campaign Template Generation Step 1: Compute a “good” common super-sequence (Majority-Merge algorithm) Beppe Signori making out – URL Jason Isaacs making out – URL Beppe Signori is really gay URL Jason Isaacs is really gay URL RIP Jonas Bevacqua is really gay URL BeppeSignoriJasonIsaacsmakingoutisreallygay-urlRIPJonasBevacquaisreallygayurl BeppeSignoriεεmakingoutεεε-urlεεεεεεε εεJasonIsaacsmakingoutεεε-urlεεεεεεε BeppeSignoriεεεεisreallygayεεεεεεεεurl εεJasonIssacsεεisreallygayεεεεεεεεurl εεεεεεεεεεεRIPJonasBevacquaisreallygayurl Super-sequence

BeppeSignoriJasonIsaacsmakingoutisreallygay-urlRIPJonasBevacquaisreallygayurl BeppeSignoriεεmakingoutεεε-urlεεεεεεε εεJasonIsaacsmakingoutεεε-urlεεεεεεε BeppeSignoriεεεεisreallygayεεεεεεεεurl εεJasonIssacsεεisreallygayεεεεεεεεurl εεεεεεεεεεεRIPJonasBevacquaisreallygayurl 14 Single Campaign Template Generation Step 2: Matrix columns reduction BeppeSignoriJasonIsaacsmakingout-RIPJonasBevacquaisreallygayurl BeppeSignoriεεmakingout-εεεεεεurl εεJasonIsaacsmakingout-εεεεεεurl BeppeSignoriεεεεεεεεisreallygayurl εεJasonIssacsεεεεεεisreallygayurl εεεεεεεRIPJonasBevacquaisreallygayurl Super-sequence (Beppe|ε) (Signori|ε) (Jason|ε) (Isaacs|ε) …

BeppeSignoriJasonIsaacsmakingout-RIPJonasBevacquaisreallygayurl BeppeSignoriεεmakingout-εεεεεεurl εεJasonIsaacsmakingout-εεεεεεurl BeppeSignoriεεεεεεεεisreallygayurl εεJasonIssacsεεεεεεisreallygayurl εεεεεεεRIPJonasBevacquaisreallygayurl 15 Single Campaign Template Generation Step 3: Matrix columns concatenation Beppe Signori|Jason Isaacs|RIP Jonas BevacquaIs really gay|making out -url Beppe Signorimaking out -url Jason Isaacsmaking out -url Beppe Signoriis really gayurl Jason Issacsis really gayurl RIP Jonas Bevacquais really gayurl Regular Expression Template

16 Spam template generation without the need for invariant substring. Automated noise labeling to identify and exclude noise words from template generation. Cluster and refine for mixture of spam campaigns. Solutions

17 Noise Labeling Key problem: spammers extensively insert noise words into spam messages –To draw a larger audience –To diversify the #hashtags, popular terms, etc.

18 Noise Labeling Goal: exclude the noise words from the template generation process. Method: treat noise detection as a sequence labeling task, using Conditional Random Fields (CRFs) approach. Output: a “noise” or “non-noise” label for each word in the message.

19 Feature Selection Intuition: noise words are popular, but the combination of them are not popular. Features: –freq(t i ) –freq(t i t i+1 ) 2 /(freq(t i )freq(t i+1 )) –freq(t i-1 t i ) 2 /(freq(t i-1 )freq(t i )) Orthographic features: –Is capitalized? –Is numeric? –Is hashtag? –Is user mention?

20 Spam template generation without the need for invariant substring. Automated noise labeling to identify and exclude noise words from template generation. Cluster and refine for mixture of spam campaigns. Solutions

21 Problem: in realistic scenario the system observes the mixture of spam instantiating multiple templates, rather than a single one. Solution: –Part 1, coarse pre-clustering, using standard clustering technique. –Part 2, refine the single campaign template generation process, by limiting the ratio of “ε” in the matrix to prune out “outlier” messages. Multi-campaign Template Generation

22 Real-time detection The auxiliary spam filter supplies training spam samples Recap: Template Generation/Matching Module

23 Dataset: –17M tweets generated between June 1, 2011 and July 21, 2011 –558,706 spam tweets Auxiliary spam filter: –The online campaign discovery module (introduced later) –63.3% TP rate, 0.27% FP rate Evaluation Results

24 Detection Accuracy ModuleTemplate Generation Auxiliary Filter Combined Spam Category Template-based95.7%70.1%98.4% Paraphrase51.0%51.4%70.1% No-content73.8%67.0%83.1% Others18.4%43.2%44.7% Overall TP76.2%63.3%85.4% FP0.12%0.27%0.33%

25 Top 5 generated templates with the most matching spam: Generated Template Example Spam #Template 11.1%^ (I wager|My my,) you (cannot|ε) (ε|defeat) this \. URL.* $ 7.2%^ The (ε|folks|people) at my (ε|place|location) are groveling for this ! URL.* $ 6.4%^ You (will not|won’t|ε) (ε|think|believe) this \. The (ε|best|greatest) (thing|factor|ε) (because|since) slice bread \. URL.* $ 5.0%^ (Cool|Wow|Amazing), I (by no means|in no way) (found|noticed) (people|anyone) (do that|ε) (just before|prior to) \. URL.* $ 4.1%^ You (will not|won’t|ε) (think|believe|ε) the (issues|points|things) they do on this (site|web page|web-site) \. URL.* $

26 Pick the top 5 campaigns All campaigns achieve almost 100% detection rate with 0.15% of messages as training samples. The system can react to newly emerged campaigns quickly. Sensitivity for New Campaigns

27 The median matching latency grows slowly with template number, less than 8ms. The largest latency is less than 80ms, unnoticeable to users. Template Matching Speed

28 Tangram: first system to real time extract multiple spam templates without unique invariants. –63% of Twitter spam is generated by templates. –Detect 95.7% of template-based spam. –Overall TP rate of 85.4% and FP rate of 0.33%. Applying text analytics in other security applications –Measuring the Description-to-permission Fidelity in Android Applications, CCS 2014 Conclusions

Existing Work, cont’d Spam template generation [Pitsillidis NDSS’10][Zhang NDSS’14] –How to detect spam without invariant substrings? Spammer account detection [Stringhihi ACSAC’10][Yang RAID’11] –How to detect spam in real-time? –How to detect spam originating from compromised accounts, e.g., in a worm propagation scenario? 29

30 Thank you! Questions?

31 Filtering Twitter spam is uniquely challenging Twitter exposes developer APIs to make it easy to interact with Twitter platform Real-time content is fundamental to Twitter user’s experience Background