CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Authors: Yue Zhang, Jason Hong, Lorrie Cranor Presented By: Kim Giglia CSC /7/2008

Introduction Automated tool to detect phishing web-sites: CANTINA June 2006: 9,255 unique phishing sites reported Estimated costs of phishing websites: $1 - $2.8 billion per year Previous studies only found one phishing detection tool with > 60% accuracy

Tools to detect/prevent phishing
Education Tools/Marks that show trustworthiness One way password hashes Proxies that are browser extensions (PassPet and WebWallet) ISP provided toolbars, services

Two major methods to detect phishing:
Heuristics – Often produce false positives Blacklists – Labor intensive One time URL’s reduce effectiveness of blacklists

How CANTINA works (without added heuristics)
Calculate the TF-IDF scores of terms on a page Generate lexical signature (five terms) Search for lexical signature (Google) Compare domain name of page to top N results (30 appears to be maximal)

What is TF-IDF? TF (term frequency) – number of times a term appears in a given document IDF (inverse document frequency) – measure of importance of a term – how common the term is in the corpus A high TF-IDF weight occurs when TF is high and IDF is lower

Develop the Lexical Signature Take 5 highest weighted TF-IDF terms
Develop the Robust Hyperlink Ex: Add the current domain name to the lexical signature

Search for lexical signature
ZMP (Zero Results Means Phishing) – if Google returns no results – it is a phishing site

Additional Heuristics
Age of Domain Known Images Suspicious URLs/Links IP Address Dots in URL Forms

CANTINA Implementation
Written in C# using .NET 2003 800 lines of code and 4 libraries Microsoft IE extension Document corpus: British National Corpus – 67,962,112 total words and 9,022 unique words Analyze the text content of the DOM Simple use interface: red traffic light

Experimental results Experiment #1:

Limitations Doesn’t deal with all Javascript modifications to pages DOM parser sometimes returns wrong text Some legit sites composed mostly of images Logos in logo heuristic must be maintained No dictionary for other languages Time lags in querying from Google Doesn’t deal with invisible text

Conclusions/Thoughts
Neat idea, but needs work on performance issues Also needs work on heuristics to reduce false positives without reducing effectiveness As long as search engine will return mostly legit sites, CANTINA works, but if not… Needs work on web pages that dynamically change using Javascript

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

Similar presentations

Presentation on theme: "CANTINA: A Content-Based Approach to Detecting Phishing Web Sites"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

Similar presentations

Presentation on theme: "CANTINA: A Content-Based Approach to Detecting Phishing Web Sites"— Presentation transcript:

Similar presentations

About project

Feedback