1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor
CS710 | KAIST Agenda Phishing Attacks Motivation & Goal Relative Work CANTINA Evaluation Conclusion 2
CS710 | KAIST Phishing Attacks(1/2) The Act of stealing personal information via the internet for the purpose of committing financial fraud Create a faked site similar to original sites like bank Send to users using variable methods Spam , XSS vulnerabilities, Malware … Technical issues URL Obfuscation Similar domain, Encoding URL… DNS hijacking Modifying hosts file, DNS server setting… Malware BHO(Browser Helper Object), Browser Toolbar, Key logger… 3
CS710 | KAIST Phishing Attacks(2/2) Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages Similar to original web site Often contain brand names and other terms that are common on a given web page Owner’s brands 4
CS710 | KAIST Motivation & Goal Phishing is a rapidly growing problem with 9,255 unique phishing sites reported in 2006 84 Anti-phishing toolbars Low accuracies There is a strong need for better automated detection algorithms A novel content-based approach for detecting phishing web sites. Accomplish the accuracy more than existing approach 5
CS710 | KAIST Related work(1/3) Anti-Phishing has four categories Why People Fall for Phishing Attacks? Have examined the reasons that people fall for phishing attacks Educating people about Phishing Attacks Focused on online training materials, testing and situated learning Anti-Phishing User Interface Focused on the development of better user interface for anti-phishing tools Automated Detection of Phishing 6
CS710 | KAIST Relative work(2/3) Anti-Phishing user interface Toolbar-based approach Browser extensions Dynamic Security Skins Web Wallet 7
CS710 | KAIST Relative Work(3/3) Automated detection of phishing To use heuristics to judge whether a page has phishing characteristics. Host name, domain name, URLs,… To use a blacklist that lists reported phishing URLs 8
CS710 | KAIST CANTINA | Basic Concept Criminals often create phishing sites by copying and then modifying a legitimate site’s web pages Contain brand names and terms of legitimate pages Robust Hyperlinks To find a broken links Add lexical signature to URLs If link doesn’t work, then feed signature to search engine Ex. TF/IDF (Term frequency/Inverse document frequency) Frequency based algorithm. Basic algorithm for search engine comparing and classifying documents A term has a high TF-IDF weight by having a high term frequency in a given document 9
CS710 | KAIST CANTINA | Basic Concept 10 Web page Calculate TF-IDF weight of each term Take the five terms with highest TF-IDF weight Search top file term(term1+term2..) using google Compare the domain name with google search results Phishing site : domain name of current page do not match the domain name of the N top search results (30)
CS710 | KAIST CANTINA | Basic Concept eBay, user, sign, help, forgot Faked Page TF/IDF Top 5 :
CS710 | KAIST CANTINA | Basic Concept eBay, user, sign, help, forgot Real Page TF/IDF Top 5 :
CS710 | KAIST CANTINA | Basic Concept
CS710 | KAIST CANTINA | Additional Solutions Basic CANTINA has a number of false positive Solutions Add the current domain name to the lexical signature ZMP(Zero results Means Phishing) Google returns zero search results –Meaningless domain(e.g., “u-s-j.be”) Larger set of heuristics based on related work From existing approach (e.g., SpoofGuard, PILFER) Age of Domain, Known Images, Suspicious URL,… 14
CS710 | KAIST Evaluation | Effectiveness #1(1/2) Four conditions Basic TF-IDF Basic TF-IDF + domain name Basic TF-IDF + ZMP Basic TF-IDF + domain + ZMP 100 phishing URLs and 100 legitimate URLs Phishing URLs : PhishTank.com Legitimate URLs : From previous study 15
CS710 | KAIST Evaluation | Effectiveness #1(2/2) 16 Basic TF-IDF + ZMP + domain False positives a little high Final TF-IDF
CS710 | KAIST Evaluation | Effectiveness #2(1/2) Want to reduce false positives Combining several heuristics method 17
CS710 | KAIST Evaluation | Effectiveness #2(2/2) Determining the best weights for these heuristics is a typical classification problem. Use a simple forward linear model Used 100 phishing URLs, 100 legitimate to find weights 18
CS710 | KAIST Evaluation | Effectiveness #3(1/2) To evaluate the effectiveness of Final-TF-IDF, Final-TD- IDF+heuristics, SpoofGuard, and Netcraft SpoofGuard : the highest true positive rate Relies entirely on heuristics Netcraft : one of the best toolbars overall Uses a combination of heuristics and an extensive blacklist. 100 phishing URLs from PhishTank.com 100 legitimate URLs 35 sites often attacked (citibank. Papayl) 35 top pages from Alexa ( most popular sites) 30 random web pages from random.yahoo.com 19
CS710 | KAIST Evaluation | Effectiveness #3(2/2) 20 Reduced false positives from 6% to 1% by combining Final-TF-IDF with simple heuristics But, true positive was decreased
CS710 | KAIST Discussion Limitations Does not apply to non-English web sites System Performance Depend on performance of Google search engine Attacks by criminals use image instead of words Add invisible text Circumventing TF-IDF and PageRank Using “Google Bombs” Attempt a DoS attack on Google 21
CS710 | KAIST Conclusion CANTINA uses TF-IDF + search engines + heuristics to find phishing web sites 97% true positives with 6% false positives 89% true positives with 1% false positives Shifts problem of identifying phishing sites to a search engine problem 22
CS710 | KAIST 23 Q&A