Design and Evaluation of a Real-Time URL Spam Filtering Service

Slides:



Advertisements
Similar presentations
1 Network-Level Spam Detection Nick Feamster Georgia Tech.
Advertisements

Click Trajectories: End-to-End Analysis of the Spam Value Chain Author : Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, M’ark F’elegyh’azi,
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
@ SPAM : T HE U NDERGROUND ON 140 C HARACTERS OR L ESS Chris Grier, Vern Paxson, Michael Zhang University of California, Berkeley Kurt Thomas University.
Detecting Social Spam Campaigns on Twitter Zi Chu & Haining Wang The College of William & Mary Indra Widjaja Bell Laboratories, Alcatel-Lucent, USA Presented.
Hulk: Eliciting Malicious Behavior in Browser Extensions
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Design and Evaluation of a Real- Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, Dawn Song University of California,
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
1 Bringing P2P to the Web: Security and Privacy in the Firecoral Network Jeff Terrace Harold Laidlaw Hao Eric Liu Sean Stern Michael Freedman.
Professor Michael J. Losacco CIS 1150 – Introduction to Computer Information Systems The Internet, the Web, and Electronic Commerce Chapter 2.
Spamscatter 1 Aug. 9 th, 2007Usenix Security 2007 Spamscatter: David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M. Voelker University of.
Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma ReMiND Laboratory Catching Classical and Hijack-based Phishing Attacks.
Prophiler: A fast filter for the large-scale detection of malicious web pages Reporter : 鄭志欣 Advisor: Hsing-Kuo Pao Date : 2011/03/31 1.
Examining the Effectiveness and Techniques of the Anti-Phishing Technology in Leading Web Browsers and Security Toolbars. Wesley W. Owen
Internet Safety Basics Being responsible -- and safer -- online Visit age-appropriate sites Minimize chatting with strangers. Think critically about.
Presentation by Kathleen Stoeckle All Your iFRAMEs Point to Us 17th USENIX Security Symposium (Security'08), San Jose, CA, 2008 Google Technical Report.
URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression Presented by : Mohammed Nazim Feroz 11/26/2013.
GIS Application in Firewall Security Log Visualization Juliana Lo.
資安新聞簡報 報告者:劉旭哲、曾家雄. Spam down, but malware up 報告者:劉旭哲.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
Phishing and Intrusion Prevention Tod Beardsley, TippingPoint (a division of 3Com), 02/15/06 – IMP-201.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
WARNINGBIRD: A Near Real-time Detection System for Suspicious URLs in Twitter Stream.
PhishScore: Hacking Phishers’ Minds
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),
Using Social Networks to Harvest Addresses Reporter: Chia-Yi Lin Advisor: Chun-Ying Huang Mail: 9/14/
Suspended Accounts in Retrospect: An Analysis of Twitter Spam Kurt Thomas, Chris Grier, Vern Paxson, Dawn Song University of California, Berkeley International.
Economics of Malware: Spam Amir Houmansadr CS660: Advanced Information Assurance Spring 2015 Content may be borrowed from other resources. See the last.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Introduction to Computers Section 8A. home How the Internet Works Anyone with access to the Internet can exchange text, data files, and programs with.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
FluXOR: Detecting and Monitoring Fast-Flux Service Networks Emanuele Passerini, Roberto Paleari, Lorenzo Martignoni, and Danilo Bruschi 5th international.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
All Your iFRAMEs Point to Us Cheng Wei. Acknowledgement This presentation is extended and modified from The presentation by Bruno Virlet All Your iFRAMEs.
Not So Fast Flux Networks for Concealing Scam Servers Theodore O. Cochran; James Cannady, Ph.D. Risks and Security of Internet and Systems (CRiSIS), 2010.
Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google.
Studying Spamming Botnets Using Botlab 台灣科技大學資工所 楊馨豪 2009/10/201 Machine Learning And Bioinformatics Laboratory.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
BY : MUHAMMAD KHUZAIMI B. ISHAK 4 ADIL PUAN MAZITA INFORMATION AND COMMUNICATION OF TECHNOLOGY.
Spam Detection Ethan Grefe December 13, 2013.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Understanding the Network-Level Behavior of Spammers Author: Anirudh Ramachandran, Nick Feamster SIGCOMM ’ 06, September 11-16, 2006, Pisa, Italy Presenter:
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
Saphe surfing! 1 SAPHE Secure Anti-Phishing Environment Presented by Uri Sternfeld.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17 1 Data Mining and Machine Learning Lab.
Twitter Games: How Successful Spammers Pick Targets Vasumathi Sridharan, Vaibhav Shankar, Minaxi Gupta School of Informatics and Computing, Indiana University.
Studying Spamming Botnets Using Botlab
The Koobface Botnet and the Rise of Social Malware Kurt Thomas David M. Nicol
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
1 Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Speaker: Jun-Yi Zheng 2010/01/18.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Off the Hook: Real-Time Client- Side Phishing Prevention System July 28 th, 2016 University of Helsinki Samuel Marchal*, Giovanni Armano*, Kalle Saari*,
Identifying Suspicious URLs: An Application of Large-Scale Online Learning Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering.
Dec 14, 2014, Harvard University
A Simple Approach for Author Profiling in MapReduce
Learning to Detect and Classify Malicious Executables in the Wild by J
A Virtual Tour of SophosLabs Building next-generation protection
Sentiment Analysis of Twitter Data(using HadoopMapreduce)
ISYM 540 Current Topics in Information System Management
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
Presentation transcript:

Design and Evaluation of a Real-Time URL Spam Filtering Service Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song IEEE Symposium on Security and Privacy 2011

OUTLINE Introduction - Monarch Related Work System Design Implementation Evaluation Discussion and Conclusion

Spam URL Advertisement Harmful content Phishing, malware, and scams Use of compromised and fraudulent accounts Email, web services

Monarch Spam URL Filtering as a Service Tens of millions of features

Related Work “Detecting spammers on Twitter” (2010) Post frequency, URLs, friends… “Behind phishing: an examination of phisher modi operandi” (2008) Lexical characteristics of phishing URLs “Cantina: a content-based approach to detecting phishing web sites” (2007) Parse HTML content

System Design Monarch’s cloud infrastructure Url Aggregation Email providers and Twitter’s streaming API Feature Collection Visits a URL with web browsers to collect page content

System Design(cont.) Monarch’s cloud infrastructure Feature Extraction Transform the raw data into a sparse feature vector Classification Training and testing by distributed logistic regression

Collect Raw Features – Web Browser “A taxonomy of JavaScript redirection spam”(2007) Lightweight browser not enough Poor HTML parsing, lack of JavaScript and plugins Instrumented version of Firefox JavaScript enabled Flash and Java installed Visited a URL and monitor a number of details

Raw Features Web Browser Initial URL and Landing URL, Redirects, Sources and Frames HTML Content, Page Links JavaScript Events, Pop-up Windows, Plugins HTTP Headers DNS Resolver Initial, final, and redirect URLs IP Address Analysis City, country, ASN Proxy and Whitelist (200 domains)

Features Vector Raw Features => sparse feature vector Canonicalize URLs Remove obfuscation Tokenize the text corpus Splitting on non-alphanumeric characters http://adl.tw/~dada/dada2.php?a=1&b=3 => domain feature [adl,tw] path feature [dada,dada2,php] query parameters feature [a,1,b,3] => (…,adl:true,adm:false,…,dada:true,…,tw:true,……..) total 49,960,691 feature(dimension)… => (1,3,a,adl,b,dada,dada2,php,tw)

Distributed Classifier Design Linear classification : feature vector Determine a weight vector A parallel online learner With regularization to yield a sparse weight vector Labeled data , Testing => -1 => non-spam site 1 => spam site

Training the weight vector Logistic Regression With subgradient L1-Regularization yi(xi.wi) larger => f(w) smaller (Classification margin, hyperplane)

Distributed Classifier Algorithm

Data Set and assumption 1.25 million spam email URLs 567,784 spam Twitter URLs 9 million non-spam Twitter URLs Checking all Twitter URLs against: Google Safebrowsing, SURBL, URIBL, APWG, Phishtank Any of its source URLs become blacklisted

Data Set and assumption(cont.) On Twitter: 36% scams, 60% phishing, 4% malware

After regularization

Implementation Amazon Web Services(AWS) infrastructure URL Aggregation A queue, keeps 300,000 URLs Feature Collection 20x6 Firefox(4.0b4) on Ubuntu 10.04 With a custom extension Firefox’s NPAPI, Linux’s “host” command, MaxMind GeoIP library and Route Views Classifier Hadoop Distributed File System On the 50-node cluster

Evaluation – Overall Accuracy 5-fold cross-validation 500,000 spam and non-spam each Training set size to 400,000 example 1:1, 4:1, 10:1 Testing set size to 200,000 example 1:1

Evaluation – Single Feature

Evaluation – Accuracy Over Time Training once only <-> Retraining every four days

Evaluation – Comparing Email and Tweet Spam Log odds ratio:

Evaluation – The Cost For Twitter, $22,751 per month

Discussion and Conclusion Evasion Feature Evasion Time-based Evasion Crawler Evasion Monarch Real-time system Spam URL Filtering as a Service $22,751 a month