11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 2015/10/17.

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
PhishZoo: Detecting Phishing Websites By Looking at Them
Chapter 5: Introduction to Information Retrieval
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
11 PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
 How many pages does it search?  How does it access all those pages?  How does it give us an answer so quickly?  How does it give us such accurate.
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites Yue Zhang University of Pittsburgh Jason I. Hong, Lorrie F. Cranor Carnegie Mellon University.
Ch 4: Information Retrieval and Text Mining
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Verma - ICISS 2014 R easoning M ining NLP Defense Rakesh M. Verma ReMiND Laboratory Catching Classical and Hijack-based Phishing Attacks.
Retrieving Location-based Data on the Web Andrei Tabarcea,
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Presented By Jay Dani.  Web Spoofing is a security attack that allows an adversary to observe and modify all web pages sent to the victim's machine,
Tag-based Social Interest Discovery
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
PhishScore: Hacking Phishers’ Minds
Visual-Similarity-Based Phishing Detection Eric Medvet, Engin Kirda, Christopher Kruegel SecureComm 2008 Sep.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Anti-Phishing Approaches Lifeng Hu
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Phishing Webpage Detection Jau-Yuan Chen COMS E6125 WHIM March 24, 2009.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
YZUCSE SYSLAB A Study of Web Search Engine Bias and its Assessment Ing-Xiang Chen and Cheng-Zen Yang Dept. of Computer Science and Engineering Yuan Ze.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
BY : MUHAMMAD KHUZAIMI B. ISHAK 4 ADIL PUAN MAZITA INFORMATION AND COMMUNICATION OF TECHNOLOGY.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Facilitating Document Annotation using Content and Querying Value.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Phishing Website Detection & Target Identification October 30 th, 2015 Samuel Marchal*, Kalle Saari*, Nidhi Singh †, N.Asokan* *Aalto University - † Intel.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
A search engine is a web site that collects and organizes content from all over the internet Search engines look through their own databases of.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Search Engines Session 5 INST 301 Introduction to Information Science.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Search Engine Optimization
Search Engine Optimization
Search Engines and Search techniques
CANTINA: A Content-Based Approach to Detecting Phishing Web Sites
Data Integration for Relational Web
Presentation transcript:

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17

References Xiang, G., and J.I. Hong. (2009). A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval. In Proceedings of WWW 2009 Full paper 2

3 Outline Introduction Hybrid detect approach System architecture Experimental evaluation Conclusion

Introduction Phishing is a significant security threat to the Internet  causes tremendous economic loss every year Proposed a novel hybrid phish detection method  information extraction (IE) 、 information retrieval (IR) Identity-based component by  directly discovering the inconsistency between their identity and the identity they are imitating. Keywords-retrieval component utilizes  IR algorithms exploiting the power of search engines to identify phish 4

Define phish webpage Satisfying the following criteria ◦ It impersonates well-known websites by replicating the whole or part of the target sites, showing high visual similarity to its targets. ◦ It is associated with a domain usually unrelated to that of its target website. ◦ It has a login form requesting sensitive information. 5

Phishing Site of eBay 6

Exploits a few properties Website brand names usually appear  title, copyright field The domain keyword is the segment in the domain representing the brand name  “Paypal” for “paypal.com Phishing webpages are much less likely to be crawled and indexed by major search engines  short-lived nature  few in-coming links 7

Hybrid approach(1/2) Consists of an  identity-based detection component  keywords-retrieval detection component Requires no  training data,  no prior knowledge of phishing signatures and specific implementations  and thus is able to adapt quickly to the constantly appearing new phishing patterns 8

Hybrid approach(2/2) Relies on identity recognition to  find the domain of the page’s declared identity  examines the legitimacy of the webpage by comparing this extracted domain with its own domain  site:declared brand domain “page domain” Not directly match two domain strings  some closely related domains (e.g., company affiliations)  such as “blogger.com” and “blogspot.com” 9

Named Entity Recognizer The NE identity recognition module augments the retrieval-based one in cases  brand names are absent in title and copyright field to control false positives  an auxiliary module to the identitybased component to reduce false positives Identifies from the  page content  meta keywords/description tags Using the well-known TF-IDF scoring function  get top ranking keywords 10

TF-IDF Algorithm Yield a weight that measures how important a word is to a document in a corpus Term Frequency (TF) ◦ The number of times a given term appears in a specific document ◦ Measure of the importance of the term within the particular document Inverse Document Frequency (IDF) ◦ Measure how common a term is across an entire collection of documents A term has a high TF-IDF weight ◦ A high term frequency in a given document ◦ A low document frequency in the whole collection of documents 11

12

Login form detection(1/2) Using the HTML DOM to identify login forms Forms on a page is characterized by three properties  FORM tags  INPUT tags  login keywords such as password 13

Login form detection(2/2) Designed the following algorithm to declare the existence of a login form  form tags, input tags and login keywords all appear in the DOM  Return true if all three are found  form and input tags are found, but login-related keywords exist outside  searching keyword “search”  Return true if a match is found  forms and inputs are detected, but phishers put login keywords in images  phishers put login keywords in images and refrain from using text to avoid being detected  return true if no text is found and only images exist 14

System architecture 15

Two strategies To accurately map a brand name to its domain,define two strategies in selecting domains when domain query matches occur. Strategy I:  evaluate domain-query matches among the top 5 search results of Google and Yahoo  If both search engines have such matches and the domain of the No.1 match from each side coincides  take it as a candidate domain of the brand corresponding to the query  If only one search engine has matches, we take the No.1 domain as a candidate brand domain Strategy II:  Just take the two branches corresponding to the italicized part of strategy I. 16

Data and Usage Our webpage collection consists of phishing cases from one source, and good webpages from six sources Phishing pages  collecting a total of 7906 phishing webpages from Phishtank Good pages came from  Alexa.com  Google: pages with keywords “signin” and “login”  3Sharp  Yahoo :directory’s bank category,Yahoo misc pages  prominent pages 17

Detecting Login Forms Successfully detected 99.82% phishing pages with login forms Remaining 0.18% (14 in absolute number) phishing pages  they either do not have a login form  use login keywords not in our list such as “serial key”  or organize the form/input tags in a way our method misses 18

Identity-based Detection under Strategy I(1/2) Experimented with five approaches, i.e., detection by ◦ title ◦ copyright ◦ TF-IDF ◦ title + copyright + NE ◦ a full-blown method with a combination of the four 19

Identity-based Detection under Strategy I(2/2) All individual detection algorithms have low FP(< 1.5%) 20

Identity-based Detection across Strategies Considering the different sizes of the phish (7906) and legitimate (3543) corpus 21

Evaluation with Other TF-IDF Approaches Zhang et al proposed CANTINA ◦ a content-based method ◦ against two state-of-the-art toolbars, SpoofGuard and Netcraft 22

23 Conclusion Presented the design and evaluation of a hybrid phish detection method Achieved a true positive rate of 90.06% with a false positive rate of 1.95%. Not requiring existing phishing signatures and training data

Questions 24