Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee 2011/3/17Data.

Similar presentations


Presentation on theme: "Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee 2011/3/17Data."— Presentation transcript:

1 Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee Email: D9815013@mail.ntust.edu.tw 2011/3/17Data Mining and Machine Learning Lab.1

2 Paper Information  Authors:  Aaron Blum (University of Alabama, Birmingham)  Brad Wardman (University of Alabama, Birmingham)  Thamar Solorio (University of Alabama, Birmingham)  Source:  ACM Artificial Intelligence Security Workshop 3 rd, 2010 2011/3/17Data Mining and Machine Learning Lab.2

3 Outline  Introduction  Related Work  Approach  Data  Evaluation  Conclusion 2011/3/17Data Mining and Machine Learning Lab.3

4 Introduction  Phishing  A cybercrime comes from spammed emails and fraudulent websites  Entice victims to provide sensitive information  The information is used to steal identities or gain access to money  Characteristics  Highly dynamic environment  Model need to be updated frequently  New ideas  Combine online learning with content-inspection based approach  Model trained only by largely lexical features (without host based features)  Provide results to show the performance of URL inspection based detection is as well as content inspection based detection 2011/3/17Data Mining and Machine Learning Lab.4

5 Related Work  Content based Phishing URL Detection  Use the similarity between the content files to detect phishing websites  Purely URL based Malicious URL Detection  Use host information and URL lexical features with online learning algorithms  PhishNet  Extend the usability of blacklists  Domain Blacklisting  Expand blacklist by the DNS zone file data and WHOIS information 2011/3/17Data Mining and Machine Learning Lab.5

6 Approach  Feature Extraction  Delimiters: “/”, ”?”, ”.”, ”=” and “_”  Bigram combination  Lexical feature groups Lexical feature groups  Learning algorithm  Confident Weighted Algorithm  Updating model by different weights of the features’ occurrence 2011/3/17Data Mining and Machine Learning Lab.6

7 Approach (cont.)  MD5 Matching  Use files’ MD5 checksum to check files similarity  Easy to evade ( by varying the content)  Examples Examples  Deep MD5 Matching  Download all the associated content files  Compare the similarity between two websites’ content files by Kulczynski 2 coefficient 2011/3/17Data Mining and Machine Learning Lab.7

8 Data  Data Source  UAB Phishing Data Mine  Two and half a year collecting time  Benigns may look “phishy” (e.g.)e.g.  9,506unique domains  25,203 URLs (6,114 malicious)  Cyveillance  18,990 unique domains  34,234 URLs (all malicious)  All feeds are fully de-duplicated  Datasets  UAB Feeds  Cyveillance full  Cyveillance abridged  Mixed 2011/3/17Data Mining and Machine Learning Lab.8

9 Data (cont.)  Percentage of total URLs vs. Individual Domains 2011/3/17Data Mining and Machine Learning Lab.9

10 Evaluation  Experiment setting  Training and testing set was conducted on daily batches  Training initially conducted on UAB data  Model will be updated by a daily URL blacklist/whitelist feed  False positive and false negative error rates were computed every prediction 2011/3/17Data Mining and Machine Learning Lab.10

11 Evaluation(cont.) 2011/3/17Data Mining and Machine Learning Lab.11

12 Evaluation(cont.) 2011/3/17Data Mining and Machine Learning Lab.12

13 Evaluation(cont.) 2011/3/17Data Mining and Machine Learning Lab.13

14 Conclusion  Lexical features based learning provide robust performance by CW algorithm  Quality diverse training data could approve a accuracy higher than 97%  For proposed system  Training data could be collected from any blacklists  Easy implement and robust performance 2011/3/17Data Mining and Machine Learning Lab.14

15 Thanks for your attention  Q&A? 2011/3/17Data Mining and Machine Learning Lab.15

16 Lexical Feature Group 2011/3/17Data Mining and Machine Learning Lab.16

17 URLs including the recipient’s email 2011/3/17Data Mining and Machine Learning Lab.17

18 Data in UAB Phishing Data Mine 2011/3/17Data Mining and Machine Learning Lab.18


Download ppt "Lexical Feature Based Phishing URL Detection Using Online Learning Reporter: Jing Chiu Advisor: Yuh-Jye Lee 2011/3/17Data."

Similar presentations


Ads by Google