CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Authors: Yue Zhang, Jason Hong, Lorrie Cranor Presented By: Kim Giglia CSC 682 10/7/2008
Introduction Automated tool to detect phishing web-sites: CANTINA June 2006: 9,255 unique phishing sites reported Estimated costs of phishing websites: $1 - $2.8 billion per year Previous studies only found one phishing detection tool with > 60% accuracy
Tools to detect/prevent phishing Education Tools/Marks that show trustworthiness One way password hashes Proxies that are browser extensions (PassPet and WebWallet) ISP provided toolbars, services
Two major methods to detect phishing: Heuristics – Often produce false positives Blacklists – Labor intensive One time URL’s reduce effectiveness of blacklists
How CANTINA works (without added heuristics) Calculate the TF-IDF scores of terms on a page Generate lexical signature (five terms) Search for lexical signature (Google) Compare domain name of page to top N results (30 appears to be maximal)
What is TF-IDF? TF (term frequency) – number of times a term appears in a given document IDF (inverse document frequency) – measure of importance of a term – how common the term is in the corpus A high TF-IDF weight occurs when TF is high and IDF is lower
Develop the Lexical Signature Take 5 highest weighted TF-IDF terms Develop the Robust Hyperlink Ex: http://dom.com/page.html?ls=t1+t2+t3+t4+t5 Add the current domain name to the lexical signature
Search for lexical signature ZMP (Zero Results Means Phishing) – if Google returns no results – it is a phishing site
Additional Heuristics Age of Domain Known Images Suspicious URLs/Links IP Address Dots in URL Forms
CANTINA Implementation Written in C# using .NET 2003 800 lines of code and 4 libraries Microsoft IE extension Document corpus: British National Corpus – 67,962,112 total words and 9,022 unique words Analyze the text content of the DOM Simple use interface: red traffic light
Experimental results Experiment #1:
Experimental results Experiment #2:
Experimental results Experiment #3:
Experimental results Experiment #4:
Limitations Doesn’t deal with all Javascript modifications to pages DOM parser sometimes returns wrong text Some legit sites composed mostly of images Logos in logo heuristic must be maintained No dictionary for other languages Time lags in querying from Google Doesn’t deal with invisible text
Conclusions/Thoughts Neat idea, but needs work on performance issues Also needs work on heuristics to reduce false positives without reducing effectiveness As long as search engine will return mostly legit sites, CANTINA works, but if not… Needs work on web pages that dynamically change using Javascript