Phishing Website Detection & Target Identification October 30 th, 2015 Samuel Marchal*, Kalle Saari*, Nidhi Singh †, N.Asokan* *Aalto University - † Intel.

Slides:



Advertisements
Similar presentations
The Biosafety Clearing-House of the Cartagena Protocol on Biosafety Tutorial – BCH common features.
Advertisements

Mobile Chicago User Group. LP Mobile Each Month… 60 Million Visits Monitored 4 Million Messages Sent.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Large-Scale Entity-Based Online Social Network Profile Linkage.
SharePoint User Group Chicago: 1/24/2013 SharePoint 2013 Search Overview.
Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.
RB-Seeker: Auto-detection of Redirection Botnet Presenter: Yi-Ren Yeh Authors: Xin Hu, Matthew Knysz, Kang G. Shin NDSS 2009 The slides is modified from.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Design and Evaluation of a Real-Time URL Spam Filtering Service
PHAD- A Phishing Avoidance and Detection Tool Using Invisible Digital Watermarking By Sonali Batra Web 2.0 Security and Privacy 2014.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Towards Semantic Web Mining Bettina Berndt Andreas Hotho Gerd Stumme.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Viola and Jones Object Detector Ruxandra Paun EE/CS/CNS Presentation
WUCM1 exam 1WUCM1. Exam format DURATION: 2 HOURS INSTRUCTIONS – Answer all questions in Section A (50 marks) and two questions from Section B (25 marks.
A field is a unit of information. Limit search by the title field.
Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma ReMiND Laboratory Catching Classical and Hijack-based Phishing Attacks.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Lecturer: Ghadah Aldehim
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
IT 210 The Internet & World Wide Web introduction.
GCSE Computing#BristolMet Session Objectives# 19 MUST understand what is meant by intellectual property and the legislation to protect ownership. SHOULD.
PhishScore: Hacking Phishers’ Minds
Chapter 5 Searching for Truth: Locating Information on the WWW.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
SURF:SURF: Detecting and Measuring Search Poisoning Long Lu, Roberto Perdisci, and Wenke Lee Georgia Tech and University of Georgia.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
Web Research © Copyright William Rowan Objectives By the end of this you will be able to: Use search engines and *URL’s on the internet as a research.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Anti-Phishing Approaches Lifeng Hu
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Phishing Webpage Detection Jau-Yuan Chen COMS E6125 WHIM March 24, 2009.
Analysis. Solution Requirements 1. Identify the functions and attributes of the website. 2. Write a problem statement. (What is the problem? What will.
Saphe surfing! 1 SAPHE Secure Anti-Phishing Environment Presented by Uri Sternfeld.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Creating Your 1 st Web Page. Tags Refers to anything between on a webpage Most appear in pairs surrounding content Some appear as empty tags (no closing.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
Off the Hook: Real-Time Client- Side Phishing Prevention System July 28 th, 2016 University of Helsinki Samuel Marchal*, Giovanni Armano*, Kalle Saari*,
January 31st, 2017 Samuel Marchal*, Giovanni Armano*, Kalle Saari*,
Site-Level Web Template Extraction
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Based on Menu Information
Attracting more traffic is the basic objective of any website owner. A website doesn’t do the job by itself - it requires a push in a right direction.
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Clustering Semantically Enhanced Web Search Results
SEO Course Outlines.
Searching for Truth: Locating Information on the WWW
HTML What is it? HTML is a computer language devised to allow website creation. These websites can then be viewed by anyone else connected to the Internet.
Lesson 4: Hyperlinks.
Searching for Truth: Locating Information on the WWW
SEO Hand Book.
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Searching for Truth: Locating Information on the WWW
Active AI Projects at WIPO
Presentation transcript:

Phishing Website Detection & Target Identification October 30 th, 2015 Samuel Marchal*, Kalle Saari*, Nidhi Singh †, N.Asokan* *Aalto University - † Intel Security

2 Outline Phishing detection system –minimal training data, language-independence, scalability –high accuracy, fast, locally computable (comparable to state-of- the-art) Target identification mechanism –language-independence, fast –High accuracy (comparable to state-of-the-art)

3 Outline Phishing detection system –minimal training data, language-independence, scalability, –high accuracy, fast, locally computable (comparable to state-of- the-art) Target identification mechanism –language-independence, fast –High accuracy (comparable to state-of-the-art)

4 Phishing Website

5 Data Sources Starting URL Landing URL Redirection chain Logged links HTML source code: –Text –Title –HREF links –Copyright Screenshot …

6 Phisher’s Control & Constraints Phishers have different level of control and are placed under some constraints while building a webpage: Control: External loaded content (logged links) and external HREF links are not controlled by page owner. Constraints: Registered domain name part of URL cannot be freely defined: constrained by registration (DNS) policies.

7 Hypothesis By modeling control/constraints in a feature set we can improve identification of phishing webpages –Will have good generalizability and be language independent By analyzing terms used in controlled and constrained sources we can identify the target of a phish

8 URL Structure Protocol = https FQDN = RDN = amazon.co.uk mld = amazon FreeURL = {www, /ap/signin?_encoding=UTF8} protocol://[subdomains.]mld.ps[/path][?query] FreeURL FQDN RDNFreeURL

9 Data Sources Control & Constraints Control / Constraint separation: –RDNs are constrained in composition –FreeURL, text, title, etc. are not constrained –RDNs in redirection chain controlled (internal) by page owner –Others RDNs (HREFs and logged links) not controlled (external) Data sources separation: UnconstrainedConstrained Controlled Text Title Copyright Internal FreeURL Internal RDNs Uncontrolled External FreeURLExternal RDNs

10 Phishing Classification System Features extraction (212) from data sources: –URL features (106) –Term usage consistency (66) –Usage of starting and landing mld (22) –RDN usage (13) –Webpage content (5) Gradient Boosting classification: –Feature selection and weighting –Robustness to over-fitting (generalizability)

11 Classification Performance (language independence) Classifier Training: –4,531 English legitimate webpages –1,036 phishing webpages Assessment: –100,000 English legitimate webpages –10,000 French legitimate webpages –10,000 German legitimate webpages –10,000 Italian legitimate webpages –10,000 Portuguese legitimate webpages –10,000 Spanish legitimate webpages –1,216 phishing webpages

12 Classification Performance (language independence) ROC CurvePrecision vs. Recall 100,000 English legitimate / 1,216 phishs PrecisionRecallFP RateAUCAccuracy

13 Scalability

14 Outline Phishing detection system –minimal training data, language-independence, scalability, –high accuracy, fast, locally computable (comparable to state-of- the-art) Target identification mechanism –language-independence, fast –High accuracy (comparable to state-of-the-art)

15 Target identification Target identification: identify a set of terms represented the impersonated service and brand: keyterms Assumption: keyterms appear in several data sources Query search engine with top keyterms to identify: –If the website is legitimate (appearing in top search results) –The potential targets of the phishing website Intersect sets of terms extracted from different visible data sources (title, text, starting/landing URL, Copyright, HREF links)

16 Target Identification Performance 600 phishing webpages with identified target: –(unverified phishes listed by PhishTank; identification done manually) TargetsIdentifiedUnknownMissedSuccess rate Top % Top % Top % Complementarity with phishing detection: –53 mislabeled legitimate webpages ( FP rate) –39 identified as legitimate in target identification Reduction of FP rate to (0.01%)

17 Concluding Remarks Phishing website detection system: –Language independent –Scalable –Fast ( < 1 second per webpage) –Client-side implementable –> 99.9% accuracy with < 0.05% false positives Target identification system: –Fast –Success rate > 90% for 1 target / 97.3% for a set of target

18 Demo Pipeline with both systems in a chain –Classify unverified phishs from PhishTank –Identify target

Phishing Website Detection & Target Identification October 30 th, 2015 Samuel Marchal*, Kalle Saari*, Nidhi Singh †, N.Asokan* *Aalto University - † Intel Security