CANTINA: A Content-Based Approach to Detecting Phishing Web Sites

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

JavaScript Part 6. Calling JavaScript functions on an event JavaScript doesn’t have a main function like other programming languages but we can imitate.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Internet MERCEDES STRONG- COMPUTER CLASS. What is INTERNET ? Brief History of Internet. Services provided by Internet. MERCEDES STRONG- COMPUTER CLASS.
PHAD- A Phishing Avoidance and Detection Tool Using Invisible Digital Watermarking By Sonali Batra Web 2.0 Security and Privacy 2014.
A Crawler-based Study of Spyware on the Web Author: Alexander Moshchuk, Tanya Bragin, Steven D.Gribble, Henry M.Levy Presented At: NDSS, 2006 Prepared.
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Yue Zhang University of Pittsburgh Jason I. Hong, Lorrie F. Cranor Carnegie Mellon University.
THE BASICS OF THE WEB Davison Web Design. Introduction to the Web Main Ideas The Internet is a worldwide network of hardware. The World Wide Web is part.
The Internet & Web Browsers Business Webpage Design Kelly Seale.
Norman SecureSurf Protect your users when surfing the Internet.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
Internet. Internet is Is a Global network Computers connected together all over that world. Grew out of American military.
Presented By Jay Dani.  Web Spoofing is a security attack that allows an adversary to observe and modify all web pages sent to the victim's machine,
Section 2.1 Compare the Internet and the Web Identify Web browser components Compare Web sites and Web pages Describe types of Web sites Section 2.2 Identify.
Lecturer: Ghadah Aldehim
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science.
1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),
Windows Internet Explorer 9 Chapter 1 Introduction to Internet Explorer.
Internet Browsers: Add-ons for Google Chrome. Browser Add-ons These are small programs that you choose to install inside your browser (therefore making.
Internet / Internet Research ACR/TSM 251 Luke E. Reese September 16, 2010.
Introduction to Internet
The Internet  Internet Hardware connected together Creates a massive worldwide network  Hardware Computers Communication lines  Interlinked collection.
CMU Usable Privacy and Security Laboratory Phinding Phish: An Evaluation of Anti-Phishing Toolbars Yue Zhang, Serge Egelman, Lorrie.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Introduction To Internet
Anti-Phishing Approaches Lifeng Hu
The Internet TCIP/IP  TCP/IP stands for Transmission Control Protocol/Internet Protocol, which is a set of networking protocols that allows two or more.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Search Engines By: Faruq Hasan.
Mrs. Walls September/October Learning the Web Vocabulary Web Sites Web Pages Web Browser To Bibliography Bibliography.
Usable Privacy and Security and Mobile Social Services Jason Hong
Week 1 Introduction to Search Engine Optimization.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
introductionwhyexamples What is a Web site? A web site is: a presentation tool; a way to communicate; a learning tool; a teaching tool; a marketing important.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
IR 6 Scoring, term weighting and the vector space model.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Search Engine Optimization
Automated Information Retrieval
MicrosoftTM SharePoint Content Management SystemTutorial
Egyptian Language School General Questions Prep.2
Chapter 10: Web Basics.
2.2 Internet Basics.
CSCE 590 Web Scraping – Information Extraction II
IS1500: Introduction to Web Development
Quantifying the Fingerprintability of Browser Extensions
Conveying Trust Serge Egelman.
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
BotCatch: A Behavior and Signature Correlated Bot Detection Approach
UNIT 15 Webpage Creator.
Web page a hypertext document connected to the World Wide Web.
Information Retrieval
Design open relay based DNS blacklist system
Correlation of Term Count and Document Frequency for Google N-Grams
Data Mining Chapter 6 Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Web Design and Development
Correlation of Term Count and Document Frequency for Google N-Grams
The Internet.
Internet Vocabulary Terms
Presentation transcript:

CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Authors: Yue Zhang, Jason Hong, Lorrie Cranor Presented By: Kim Giglia CSC 682 10/7/2008

Introduction Automated tool to detect phishing web-sites: CANTINA June 2006: 9,255 unique phishing sites reported Estimated costs of phishing websites: $1 - $2.8 billion per year Previous studies only found one phishing detection tool with > 60% accuracy

Tools to detect/prevent phishing Education Tools/Marks that show trustworthiness One way password hashes Proxies that are browser extensions (PassPet and WebWallet) ISP provided toolbars, services

Two major methods to detect phishing: Heuristics – Often produce false positives Blacklists – Labor intensive One time URL’s reduce effectiveness of blacklists

How CANTINA works (without added heuristics) Calculate the TF-IDF scores of terms on a page Generate lexical signature (five terms) Search for lexical signature (Google) Compare domain name of page to top N results (30 appears to be maximal)

What is TF-IDF? TF (term frequency) – number of times a term appears in a given document IDF (inverse document frequency) – measure of importance of a term – how common the term is in the corpus A high TF-IDF weight occurs when TF is high and IDF is lower

Develop the Lexical Signature Take 5 highest weighted TF-IDF terms Develop the Robust Hyperlink Ex: http://dom.com/page.html?ls=t1+t2+t3+t4+t5 Add the current domain name to the lexical signature

Search for lexical signature ZMP (Zero Results Means Phishing) – if Google returns no results – it is a phishing site

Additional Heuristics Age of Domain Known Images Suspicious URLs/Links IP Address Dots in URL Forms

CANTINA Implementation Written in C# using .NET 2003 800 lines of code and 4 libraries Microsoft IE extension Document corpus: British National Corpus – 67,962,112 total words and 9,022 unique words Analyze the text content of the DOM Simple use interface: red traffic light

Experimental results Experiment #1:

Experimental results Experiment #2:

Experimental results Experiment #3:

Experimental results Experiment #4:

Limitations Doesn’t deal with all Javascript modifications to pages DOM parser sometimes returns wrong text Some legit sites composed mostly of images Logos in logo heuristic must be maintained No dictionary for other languages Time lags in querying from Google Doesn’t deal with invisible text

Conclusions/Thoughts Neat idea, but needs work on performance issues Also needs work on heuristics to reduce false positives without reducing effectiveness As long as search engine will return mostly legit sites, CANTINA works, but if not… Needs work on web pages that dynamically change using Javascript