Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analyzing Behavioral Features for Classification

Similar presentations


Presentation on theme: "Analyzing Behavioral Features for Classification"— Presentation transcript:

1 Analyzing Behavioral Features for Email Classification
:- Virat Agarwal

2 Analyzing Behavioral Features for Email Classification
Steve Martin, Anil Sewani, Blaine Nelson, Karl Chen, Anthony D. Joseph University of California, Berkeley Conference on and Anti Spam (CEAS 2005)

3 Motivation is one of the most ubiquitous methods of communication >60 Billion Messages Daily (International Data Group) >50% unsolicited mail, spam (Message Labs) 7/10 computer worms spread via . (Sophos Corporation) Phishing --- Growing Concern

4 Outline Background Analyzing Behavioral Features Application/Results
Conclusion

5 What is Phishing? -- Listening to Music by the band called Phish? -- a hobby, sport or recreation involving the ocean or river? NO!!! “Fishing for Personal Information” Using “spoofed” s and fraudulent websites designed to fool recipients into divulging personal and financial data

6 Example of Phishing From: Customer Support …
Sent: Thursday, October 07 … To: … Subject: NOTE! Citibank account suspend in process Dear Customer: Recently there have been a large number of cyber attacks … we require you to sign on immediately. … This process is mandatory, … your account may be subject to temporary suspension. Please make sure you have your Citibank(R) debit card number and your User ID and Password at hand. … please click the link below: Regards, Citibank(R) Card Department (C)2004 Citibank. …

7 What is a spam? Unsolicited Commercial (UCE), also known as "spam" or "junk “ that is “unwanted, “inappropriate” and no longer wanted…” Why is classifying mail as spam difficult? business development mails : 1. targeted to specific people .. 2. might arouse interest to some .. thus become of matter for some 3. each is personalized 4. some call it spam

8

9 Why is stopping this so difficult?
SMTP (Simple Mail Transfer Protocol) SMTP is not secure can be altered en route ⇒ Source Info Lost! This is exploited by spammers

10 Current Methods … Current methods examine incoming mail
Spam Detectors : Statistical Features for classification Virus Scanners : Does signature comparison. How about Statistical Analysis of outgoing mail? the spam detectors work by statistically analysing and creating a user profile based on his/her incoming mail. virus scanners have virus definition files that contain signatures or hash values of the known viruses and for any incoming mail if the hash value calculated by the scanner matches with the human generated signature then the mail is considered infected. however newly generated virus may easily escape this test. one problem with classification of incoming mail is that it is composed of s from several users and might be contaminated with spam and virus . thus profiling of a user based on the incoming mail is probably not right. outgoing mail can be used to profile a user;s normal behaviour and abnormal activites from spambots and virus can be detected and contained at the source.

11 Outline Background Analyzing Behavioral Features Application/Results
Conclusion

12 What is a Feature? Definition
Set of features that distinguishes between normal and abnormal activity. Feature can continuous or multinomial value Feature is a statistic that measures some aspect of a user;s activity or behaviour frequency calculation returns a no. and types of attachments return a bit array.

13 Behavioral Analysis of Outgoing Email Feature Histogram
Comparison of two users characteristics based on the no of distinct addresses an is sent over a window of messages Source: Enron Dataset 148 ⇔126, users mails relative frequency on y axis .. no of distinct addresses on the x axis .. the second graph is the histogram of the abc diff between the two users histogram by summing up the values we can get a rough estimate of the similarity between two users based on this metric1

14 Feature Histogram (contd …)
note that summing up the values in the previous histogram gives a value of similarity between 0 and 2 . we do this for each user and plot a frequency histogram. here we notice that 1. behavioral features are different per user 2. certain features vary more widely than others

15 Analyzing relevant features
NP – Complete Approximation Algorithms PCA does not lead to good classification. PCA based on maximum covariance gives direction of maximum correlation to the dataset. Thus for each user we get a signature (of features). choosing the subset of features that optimally predicts the desired function is NP - complete pca gives a linear combination and thus does not give the seperate components from the individuall data set

16 this helps in clustering of the users based on feature similarities

17 Finally … Feature Selection
Feature Ranking Rank features based on greedy approach Greedy Approach – Not Optimal Ratio of s with attachments Binary attachment 5. Frequency of MIME type application 6. No of Attachments Magic type Application

18 Outline Background Analyzing Behavioral Features Application/Results
Conclusion

19 Novel Worm Detection New worms can widespread infection before antivirus scanner is updated This approach chokes off avenues of infection

20 Results Setup for analysis :-
VMWare Virtual Machines infected to capture worm messages. Artificial Training Sets created by injecting worm messages (2 worms). Testing sets created be injecting worm messages. (3 worms, some same, some different). Two different classification models used SVMs Naïve Bayes Classification

21 Results (contd…)

22 Outline Background Analyzing Behavioral Features Application/Results
Conclusion

23 Conclusion Feature Generation on outgoing email traffic.
User behavior can be clustered into sets Large scale detection system possible Feature Selection necessary to prevent overfitting. Abnormal behavior detected with an accuracy of 0.9

24 Thank You! Questions?


Download ppt "Analyzing Behavioral Features for Classification"

Similar presentations


Ads by Google