11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.

Slides:



Advertisements
Similar presentations
PhishZoo: Detecting Phishing Websites By Looking at Them
Advertisements

11 PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
C MU U sable P rivacy and S ecurity Laboratory Anti-Phishing Phil The Design and Evaluation of a Game That Teaches People Not to.
Design and Evaluation of a Real-Time URL Spam Filtering Service
PHAD- A Phishing Avoidance and Detection Tool Using Invisible Digital Watermarking By Sonali Batra Web 2.0 Security and Privacy 2014.
Internet Phishing Not the kind of Fishing you are used to.
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Yue Zhang University of Pittsburgh Jason I. Hong, Lorrie F. Cranor Carnegie Mellon University.
User Interfaces and Algorithms for Fighting Phishing Jason I. Hong Carnegie Mellon University.
User Interfaces and Algorithms for Fighting Phishing Jason I. Hong Carnegie Mellon University.
CMU Usable Privacy and Security Laboratory A Brief History of Semantic Attacks or How Not to Get Screwed Online Serge Egelman.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Usable Privacy and Security Jason I. Hong Carnegie Mellon University.
Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma ReMiND Laboratory Catching Classical and Hijack-based Phishing Attacks.
Detection of Internet Scam Using Logistic Regression
Search Engine Optimization
Examining the Effectiveness and Techniques of the Anti-Phishing Technology in Leading Web Browsers and Security Toolbars. Wesley W. Owen
Presented By Jay Dani.  Web Spoofing is a security attack that allows an adversary to observe and modify all web pages sent to the victim's machine,
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
PhishScore: Hacking Phishers’ Minds
User Interfaces and Algorithms for Fighting Phishing Jason I. Hong Carnegie Mellon University.
Visual-Similarity-Based Phishing Detection Eric Medvet, Engin Kirda, Christopher Kruegel SecureComm 2008 Sep.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
John P., Fang Yu, Yinglian Xie, Martin Abadi, Arvind Krishnamurthy University of California, Santa Cruz USENIX SECURITY SYMPOSIUM, August, 2010 John P.,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
1 ITGS - introduction A computer may have: a direct connection to a net (cable); or remote access (modem). Connect network to other network through: cables.
KAIST Web Wallet: Preventing Phishing Attacks by Revealing User Intentions Min Wu, Robert C. Miller and Greg Little Symposium On Usable Privacy and Security.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Adam Soph, Alexandra Smith, Landon Peterson. Phishing is a way of attempting to acquire information such as usernames, passwords, and credit card details.
CMU Usable Privacy and Security Laboratory Phinding Phish: An Evaluation of Anti-Phishing Toolbars Yue Zhang, Serge Egelman, Lorrie.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Anti-Phishing Approaches Lifeng Hu
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Web Spoofing Steve Newell Mike Falcon Computer Security CIS 4360.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Search engines are the key to finding specific information on the vast expanse of the World Wide Web. Without sophisticated search engines, it would be.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
CCT355H5 F Presentation: Phishing November Jennifer Li.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee /3/17Data.
Search Engines By: Faruq Hasan.
Return to the PC Security web page Lesson 6: Improving Security.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
C MU U sable P rivacy and S ecurity Laboratory Protecting People from Phishing: The Design and Evaluation of an Embedded Training.
#pubcon Lateral Keywords for Writers ( When the Google Keyword Planner isn’t enough) Presented by: Ash Nallawalla.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
Usable Privacy and Security and Mobile Social Services Jason Hong
1 CS 430: Information Discovery Lecture 5 Ranking.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
1 Phinding Phish : Evaluating Anti- Phishing Tools Yue Zhang,Jason Hong (2007) Carnegie Mellon University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
CSCE 590 Web Scraping – Information Extraction II
Detection of Internet Scam Using Logistic Regression
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Information Retrieval
Information retrieval and PageRank
Anatomy of a Search Search The Index:
Presentation transcript:

11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7

2 Reference Y. Zhang, J. Hong, and L. Cranor, “Cantina: A content-based approach to detecting phishing web sites,” in proceedings of the International World Wide Web Conference (WWW), 2007.

3 Outline Introduction Automated Detection of Phishing A Content-based Approach for Detecting Phishing Web Sites Evaluation Conclusion

4 Introduction Phishing ◦ A kind of attack in which victims are tricked by spoofed s and fraudulent web sites into giving up personal information How many phishing sites are there? ◦ 9,255 unique phishing sites were reported in June of 2006 alone How much phishing costs each year? ◦ $1 billion to 2.8 billion per year

5 Automated Detection of Phishing Use heuristics to judge whether a page has phishing characteristics ◦ Can detect phishing attacks as soon as they are launched ◦ Attackers may be able to design their attacks to avoid heuristic detection ◦ Often produce false positives Use a blacklist that lists reported phishing URLs ◦ Higher accuracy ◦ Require human intervention and verification

6 CANTINA: Content-based Approach TF-IDF algorithm Robust Hyperlinks Adapting TF-IDF for detecting phishing A set of auxiliary heuristics

7 TF-IDF Algorithm Yield a weight that measures how important a word is to a document in a corpus Term Frequency (TF) ◦ The number of times a given term appears in a specific document ◦ Measure of the importance of the term within the particular document Inverse Document Frequency (IDF) ◦ Measure how common a term is across an entire collection of documents A term has a high TF-IDF weight ◦ A high term frequency in a given document ◦ A low document frequency in the whole collection of documents

8 Robust Hyperlinks Overcome the problem of broken links Basic idea ◦ Add a small number of well-chosen terms, which they called a lexical signature, to URLs Create signatures ◦ Calculate the TF-IDF value for each word in a document, and then select the words with highest value Lexical signature of about five terms are sufficient to determine a web resource virtually uniquely

9 How CANTINA Works? (1/2) Given a web page, calculate the TF-IDF scores of each term on that web page Generate a lexical signature by taking the five terms with highest TF-IDF weights Feed this lexical signature to a search engine, which in the case is Google If the domain name of the current web page matches the domain name of the N top search results, it will be considered a legitimate web site. Otherwise, it will be considered a phishing site.

10 How CANTINA Works? (2/2) Assumption ◦ Google indexes the vast majority of legitimate web sites, and that legitimate sites will be ranked higher than phishing sites Two heuristics ◦ Domain name  Add the current domain name to the lexical signature ◦ Zero results Means Phishing (ZMP)  If Google fails to return any result, the suspected site will be labeled as phishing Example: eBay & its phishing site

11 Example: eBay

12 Phishing Site of eBay

13 Legitimate Site of eBay

14 Auxiliary Heuristics

15 Evaluation Four experiments ◦ Evaluation of TF-IDF ◦ Evaluation of heuristics ◦ Evaluation of CANTINA ◦ Evaluation of CANTINA using URLs gathered from Two metrics ◦ True positives (correctly labeling a phishing site as phishing, higher is better) ◦ False positives (incorrectly labeling a legitimate site as phishing, lower is better)

16 Experiment 1 – Evaluation of TF-IDF (1/3) Basic TF-IDF – Calculate the lexical signature based on the top 5 terms, submit that to Google, and check if the domain name of the page in question matches any of the top 30 results Basic TF-IDF+domain – Same as Basic TF-IDF, except that the domain name of the page in question is added to the lexical signature Basic TF-IDF+ZMP – Same as Basic TF-IDF, except that zero search results means that the page in question is labeled as a phishing site (ZMP is “zero means phishing”) Basic TF-IDF+domain+ZMP – A combination of the two variants above. This combination turned out to have the best results, and is also called Final-TF-IDF in later sections

17 Experiment 1 – Evaluation of TF-IDF (2/3) 100 phishing URLs from PhishTank.com 100 legitimate URLs from a list of 500 used in 3Sharp’s study of anti-phishing toolbars

18 Experiment 1 – Evaluation of TF-IDF (3/3)

19 Experiment 2 – Evaluation of Heuristics (1/2) Determine the best weights for these heuristics. If S = 1, the page is labeled as a legitimate page, and if S = -1, it is labeled as a phishing site

20 Experiment 2 – Evaluation of Heuristics (2/2)

21 Experiment 3 – Evaluation of CANTINA (1/2) Comparison: ◦ Final-TF-IDF, Final-TD-IDF+heuristics, SpoofGuard, and Netcraft SpoofGuard ◦ Rely entirely on heuristics Netcraft ◦ Use a combination of heuristics and a blacklist 100 phishing URLs with unique domains from PhishTank 100 legitimate URLs using the following strategy ◦ Select the login pages of 35 sites that are often attacked by phishers ◦ Select the 35 top pages from Alexa Web Search ◦ Select 30 random pages from and manually verify that they are legitimatehttp://random.yahoo.com/fast/ryl

22 Experiment 3 – Evaluation of CANTINA (2/2)

23 Experiment 4 – Evaluation of CANTINA Using URLs Gathered from (1/2) Evaluate CANTINA using URLs gathered from users’ actual Gather 3038 unique URLs, of which only 2519 were active, from the messages Label the active URLs as “phishing,” “spam,” or “legitimate.” manually ◦ Phishing: pages that impersonate known brands and ask for personal data (19) ◦ Spam: those selling unsolicited products or services (388) ◦ All other URLs were deemed legitimate (2100)

24 Experiment 4 – Evaluation of CANTINA Using URLs Gathered from (2/2)

25 Conclusion A content-based approach for detecting phishing web sites Pure TF-IDF approach ◦ Can catch 97% phishing sites with about 6% false positives Final TF-IDF approach ◦ Can catch about 90% of phishing sites with only 1% false positives