1 Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD) Speaker Po-Jiu Wang Institute of Information Science.

Slides:



Advertisements
Similar presentations
PhishZoo: Detecting Phishing Websites By Looking at Them
Advertisements

Chapter 3 – Web Design Tables & Page Layout
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Chapter 6 Photoshop and ImageReady: Part II The Web Warrior Guide to Web Design Technologies.
1 CANTINA : A Content-Based Approach to Detecting Phishing Web Sites WWW Yue Zhang, Jason Hong, and Lorrie Cranor.
Report : 鄭志欣 Advisor: Hsing-Kuo Pao 1 Learning to Detect Phishing s I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing s. In Proceedings.
Internet Phishing Not the kind of Fishing you are used to.
Evaluating Search Engine
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
Detecting Image Region Duplication Using SIFT Features March 16, ICASSP 2010 Dallas, TX Xunyu Pan and Siwei Lyu Computer Science Department University.
URL Obscuring COEN 152/252 Computer Forensics  Thomas Schwarz, S.J
Phishing – Read Behind The Lines Veljko Pejović
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Create a Web Site with Frames
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
XP Tutorial 5New Perspectives on HTML, XHTML, and DHTML, Comprehensive 1 Designing a Web Site with Frames Using Frames to Display Multiple Web Pages Tutorial.
INTRODUCTION Problem: Damage condition of residential areas are more concerned than that of natural areas in post-hurricane damage assessment. Recognition.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
Accuracy Assessment. 2 Because it is not practical to test every pixel in the classification image, a representative sample of reference points in the.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Visual-Similarity-Based Phishing Detection Eric Medvet, Engin Kirda, Christopher Kruegel SecureComm 2008 Sep.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Problem Statement A pair of images or videos in which one is close to the exact duplicate of the other, but different in conditions related to capture,
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Adam Soph, Alexandra Smith, Landon Peterson. Phishing is a way of attempting to acquire information such as usernames, passwords, and credit card details.
Presented by Tienwei Tsai July, 2005
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Hyperlinks. Linking pages…Hyperlinks 2 Lecture 8  Hyperlink “A clickable HTML element that will direct the web browser to display a different Web page.
DHTML AND JAVASCRIPT Genetic Computer School LESSON 2 HTML TAGS G H E F.
Department of Information Technology Chapter 8 - Creating Hypertext links Lecturer: Ms Melinda Chung.
Developing a Web Site. Web Site Navigational Structures A storyboard is a diagram of a Web site’s structure, showing all the pages in the site and indicating.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.
Web Spoofing Steve Newell Mike Falcon Computer Security CIS 4360.
Phishing Webpage Detection Jau-Yuan Chen COMS E6125 WHIM March 24, 2009.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.
VENKAT DEEP RAJAN SUMALATHA REDDY KARTHIK INJARAPU CPSC 620 CLEMSON UNIVERSITY.
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
An Evaluation of Extended Validation and Picture-in-Picture Phishing Attacks Collin Jackson et. all Presented by Roy Ford.
A Framework for Detection and Measurement of Phishing Attacks Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 2/25/2016 Slide.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
`. Lecture Overview HTML Body Elements Linking techniques HyperText references Linking images Linking to locations on a page Linking to a fragment on.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
Network Anomaly Detection Using Autonomous System Flow Aggregates Thienne Johnson 1,2 and Loukas Lazos 1 1 Department of Electrical and Computer Engineering.
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Windows Vista Configuration MCTS : Internet Explorer 7.0.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Bag-of-Visual-Words Based Feature Extraction
iSRD Spam Review Detection with Imbalanced Data Distributions
Multimedia Information Retrieval
Presentation transcript:

1 Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD) Speaker Po-Jiu Wang Institute of Information Science Academia Sinica Author Anthony Y. Fu Department of Computer Science, City University of Hong Kong IEEE 2006

2 Outline What is phishing Various phishing techniques Previous anti-phishing works Evaluating webpage distance with EMD  What is EMD, and its advantage  Color and its coordinate distance with EMD Conclusion and tentative work to do

3 What is phishing Phishing is a criminal trick of stealing personal information through requesting people to access a fake webpage. How to “request people to…”?  Phishing , BBS, chatting room, etc.  Spoofing: free gift, identity confirmation etc.

4 Various phishing techniques The most straightforward way for a phisher to spoof people is to make the appearance of webpage links and webpages similar to the real ones.

5 Various phishing techniques (Link based phishing obfuscation) The link based phishing obfuscation can be carried out in four ways below:  Adding suffix to domain name of URL. E.g., revise to  Using actual link different from visible link. E.g., the HTML line: ;

6 Various Phishing Techniques (Link based phishing obfuscation 1)  Using bug in real webpage to redirect to other webpages. E.g., the bug of eBay website: Domain&DomainUrl=PHISHINGLINK can direct you to any specified PHISHINGLINK;  And replacing similar characters in the real link. E.g., replace “I”s (uppercase “i”) with “l” (lowercase of “L”) or “1” (Arabic number one), such as to

7 Various Phishing Techniques (webpage based obfuscation) The webpage based obfuscation can be carried out in three basic ways below  Using the downloaded webpage from real website to make the phishing webpage appear and react exactly the same with the real one;

8 Various Phishing Techniques (webpage based obfuscation 1)  Using script or add-in to web browser to cover the address bar to spoof users to believe they have entered the correct website;  And using visual based content (E.g., image, flash, video, etc.) rather than HTML to avoid HTML based phishing detection.

9 Previous Anti-Phishing Works Anti-Spamming  Phishing is spam. Phisher do address harvest, and broadcast to the potential victims. Human aided  Banks employ a group of people to monitor the Phishing activities. E.g. HSBC

10 Previous Anti-Phishing Works (1) Duplicate document detection approaches, which focus on plain text documents and use pure text features in similarity measure.

11 Motivation Phishing Web pages always have high visual similarity with the real Web pages. An effective approach called image-based EMD is proposed to calculate the visual similarity of Web pages.

12 Evaluating webpage distance with EMD EMD is Earth Mover’s Distance and it is based on the well known transportation problem  Suppose we have m producers P={(p 1,w p1 ),(p 2,w p2 )…(p m,w pm )}  N customers C={(c 1,w c1 ),(c 2,w c2 )…(c n,w cn )}  Distance matrix D=[d ij ] is given

13 Evaluating webpage distance with EMD (transportation fee) The task is to find a flow matrix F =[f ij ] which contains factors indicating the amount of product to be moved from one producer to one consumer.

14 Evaluating webpage distance with EMD (total cost of transportation fee) The total cost of transportation fee can be represented as: ST:

15 Evaluating webpage distance with EMD (final equation of EMD) The EMD can be represented as:

16 Advantage of EMD Represent problems involving multi- featured signatures Allow for partial matches in a very natural way Fit for cognitive distance evaluation

17 Color and its coordinate distance with EMD (Preprocess image data) Preprocess image data  Compress them to 10*10 pixes Experiment shows that the calculation time can be heavily reduced through image size compression without reducing the precision an recall  E.g.

18 The calculation of the distance of pixel color and coordinate Get the signature of webpage1 and webpage2 using pixel color and coordinate Calculate D=[d ij ]. d ij =Distance(Color(pixel i ), Color(pixel j ), Coordinate(pixel i ), Coordinate(pixel j )) EMDColorAndCordinate= EMDDist(Signature1,Signature2, D)

19 The improved color space The color of each pixel in the resized images is represented using the ARGB (alpha, red, green, and blue) scheme with 4 bytes (32 bits). A degraded color space called Color Degrading Factor (CDF) is needed. Thus, the degraded color space is (2 8 /CDF) 4.

20 The centroid of degraded color space The centroid of each degraded color is calculated using: The centroid of degraded color dc The coordinates of the ith pixel that has degraded color dc The total number of pixels that have degraded color dc

21 Computing visual similarity from EMD First, the normalized euclidian distance of the degraded ARGB colors is calculated, and then the normalized Euclidian distance of centroids is calculated.

22 The maximum color distance Suppose feature where,feature,where, the maximum color distance, the maximum color distance is

23 The normalized color distance The normalized color distance ND color is defined as

24 The normalized centroid distance The maximum centroid distance MD centroid = where w and h are the width and height of the resized images, respectively. The normalized color distance ND centroid is defined as

25 Final equation of EMD The two distances are added up with weights p and q,respectively, to form the feature distance, where p+q =1.

26 Computing EMD-based visual similarity of two images is the amplifier of visual similarity

27 An improved adjusted threshold for classification A special threshold for each given protected web page is used to classify a web page to be a phishing web page or a normal one. denotes the threshold of the ith protected Web page

28 Two types of misclassifications False alarm  The visual similarity is larger than or equal to t but, in fact, the web page is not a phishing Web page (false positive). Missing  The visual similarity is less than t but, in fact, the web page is a phishing one (false negative). VSS i correlates to two accessory parameters, the false alarm number and false negative

29 The way to classify phishing page When a suspected web page comes, the visual similarity vector which can be represented as and the classification result using the following equation:

30 Experiment configuration of phishing detection performance 10,272 homepages are selected from the web. 9 phishing web pages which targeted at 8 real protected web pages. The 10,272+9 web pages are mixed together to form the Suspected Webpage Set. Randomly selected 1,000 web pages from the 10,272 ones, combining with the 9 phishing webpages to form the Training Webpage Set.

31 Train a threshold vector We use the Train Webpage Set to train a threshold vector Protected Webpage Threshold(T) real-Bank of Oklahoma - Online real-ebay real-eBay real-ICBC(Asia) real-Key Bank real-us bank real-Washington Mutual real-Wells Fargo Sign On

32 Classification precision, phishing recall, and false alarm list ( = 0, 9281 Suspected Web Pages)

33 Reduce false negative possibilities !! Classification precision, phishing recall, and false alarm list ( = 0.005, 9281 Suspected Web Pages)

34 Phishing detection performance of image-based EMD There are 65 false alarms

35 Phishing detection performance of HTML/DOM-based EMD There are 849 false alarms

36 Phishing detection performance of similarity assessment-based EMD There are 697 false alarms

37 Experiment results The threshold vector to is used to classify an suspected webpage. In order to reduce false negative possibilities, there is a necessary sacrifice needed under Empirically set the parameters w =h =100, = 0.5,|S s | =20, p=q=0.5, and CDF=32 in our experiments by tuning.

38 The number of ground truth web page for each protected web page

39 The configuration of tuning the parameters Take as the sample number for each protected web If a web page in the N sample collected web pages is in the corresponding ground truth group, it is counted as a correctly detected similar web page.

40 Tuning the parameters (w and h) We have four configuration options (w=h =10,,100, and ) to tune w and h.

41 Tuning the parameters (p and q) 11 configuration options (p : q =0 : 1; 0:1 : 0:9; 0:2 :0:8;... ; 0:9 : 0:1;1:0) to are used to tune p and q.

42 Tuning the parameters (sample color number) Six configuration options (|S s | = 5, 10, 15, 20, 25, and 30) are used to tune |S s |.

43 Tuning the parameters (CDF) Eight configuration options (CDF =8, 16, 24, 32, 40,48, 56, and 64) to tune CDF.

44 The built architecture anti-phising system

45 Conclusions This approach works at the pixel level of Web pages rather than at the text level. Experiments show that our method can achieve satisfying classification precision and phishing recall. The time efficiency of computation is also acceptable for online phishing detection.

46 Tentative works Continue with more phishing examples and even larger scale datasets. The method could not detect those which are not visually similar. Keep working on developing a client-side application

47 Thanks for your attention.