Download presentation
Presentation is loading. Please wait.
Published bySamuel Paul Modified over 6 years ago
1
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Authors: Yue Zhang, Jason Hong, Lorrie Cranor Presented By: Kim Giglia CSC /7/2008
2
Introduction Automated tool to detect phishing web-sites: CANTINA June 2006: 9,255 unique phishing sites reported Estimated costs of phishing websites: $1 - $2.8 billion per year Previous studies only found one phishing detection tool with > 60% accuracy
3
Tools to detect/prevent phishing
Education Tools/Marks that show trustworthiness One way password hashes Proxies that are browser extensions (PassPet and WebWallet) ISP provided toolbars, services
4
Two major methods to detect phishing:
Heuristics – Often produce false positives Blacklists – Labor intensive One time URL’s reduce effectiveness of blacklists
5
How CANTINA works (without added heuristics)
Calculate the TF-IDF scores of terms on a page Generate lexical signature (five terms) Search for lexical signature (Google) Compare domain name of page to top N results (30 appears to be maximal)
6
What is TF-IDF? TF (term frequency) – number of times a term appears in a given document IDF (inverse document frequency) – measure of importance of a term – how common the term is in the corpus A high TF-IDF weight occurs when TF is high and IDF is lower
7
Develop the Lexical Signature Take 5 highest weighted TF-IDF terms
Develop the Robust Hyperlink Ex: Add the current domain name to the lexical signature
8
Search for lexical signature
ZMP (Zero Results Means Phishing) – if Google returns no results – it is a phishing site
9
Additional Heuristics
Age of Domain Known Images Suspicious URLs/Links IP Address Dots in URL Forms
10
CANTINA Implementation
Written in C# using .NET 2003 800 lines of code and 4 libraries Microsoft IE extension Document corpus: British National Corpus – 67,962,112 total words and 9,022 unique words Analyze the text content of the DOM Simple use interface: red traffic light
11
Experimental results Experiment #1:
12
Experimental results Experiment #2:
13
Experimental results Experiment #3:
14
Experimental results Experiment #4:
15
Limitations Doesn’t deal with all Javascript modifications to pages DOM parser sometimes returns wrong text Some legit sites composed mostly of images Logos in logo heuristic must be maintained No dictionary for other languages Time lags in querying from Google Doesn’t deal with invisible text
16
Conclusions/Thoughts
Neat idea, but needs work on performance issues Also needs work on heuristics to reduce false positives without reducing effectiveness As long as search engine will return mostly legit sites, CANTINA works, but if not… Needs work on web pages that dynamically change using Javascript
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.