Detecting Phishing in Emails Srikanth Palla Ram Dantu University of North Texas, Denton.

2 What is Phishing?  Phishing is a form of online identity theft  Employs both social engineering and technical subterfuge  Targets consumers' personal identity data and financial account credentials such as credit card numbers, account usernames, passwords and social security numbers.  Social-engineering schemes use 'spoofed' e-mails to lead consumers to counterfeit websites. -Anti Phishing Working Group (APWG)

3 Phishing Tactics  Hijacking reputable brand names  Creating a plausible premise  Redirecting URL’s  Collecting confidential information through emails

4 Do we need to restrict Phishing attacks?

5 The Statistics… Sources: Anti Phishing Working Group

6 Problems with Current Spam Filtering Techniques  Current spam filters focus on analyzing the content  Majority of the Phishers obfuscate their email content to bypass the email filters  Labels an email as BULK and expect the recipients’ to make a decision on the authenticity of the email source  Current spam filters have high degree of false positives

7 Methodology Our method examines:  The header of the email (not content)  The social network of the recipient  Credibility of the source  Classifies Phishers as:  Prospective Phishers  Recent Phishers  Suspects  Serial Phishers

8 Traffic Profile The following Figure describes the incoming email traffic profiles based on number of recipients and how often they receive the message.

9 Email Corpus Traffic Profile  Our analysis requires sent email folder of the recipient  Emails provided in the TREC evaluation tool kit are spam and non spam emails  We require a mix of legitimate and phising emails to evaluate our filter  We have analyzed a live corpus of 13,843 emails, collected over 2.5 years. This corpus has a mix of legitimate, spam and phishing emails. Different categories of emails are shown in the figure

10 Experimental Setup  We deployed our classifier on a recipient’s local machine running an IMAP proxy and thunderbird (MUA).  All the recipient’s emails were fed directly into our classifier by the proxy.  Our classifier periodically scans the user’s mailbox files for any new incoming emails.  DNS-based header analysis, social network analysis, wantedness analysis were performed on each of the emails.  The end result is tagging of emails as either Phishing, Opt-outs, Socially distinct and Socially close.

11 Architecture The architecture model of our classifier consists of three analyses  Step 1: DNS-based header analysis  Step 2: Social network analysis  Step 3: Wantedness analysis  Step 4: Classification

12 Step 1: DNS-based Header Analysis Stage 1: In this step, we validate the information provided in the email header: the hostname position of the sender, the mail server and the relays in the rest of the path. We divide the entire corpus into two buckets.  The emails which are valid for DNS lookups (Bucket 1).  The emails which are not valid for DNS lookups (Bucket 2). Stage 2: This step involves doing DNS lookup on the hostname provided in the Received: lines of the header and matching the IP address returned, with the IP address which is stored next to the hostname, by the relays during the SMTP authorization process. Bucket 1 is further divided into:  Trusted bucket.  Untrusted bucket. We pass the Bucket2 and both trusted and untrusted buckets to the Social Network Analysis phase for further analysis.

13 Step 2: Social Network Analysis Each of the three buckets: bucket2, untrusted bucket and trusted bucket received from the DNS-based header analysis are treated with the rules formulated by analyzing the “sent” folder emails of the receiver. For instance,  All emails from trusted domains will be removed  Familiarity to sender’s community  Familiarity to the path traversed The rules can be built as per the recipients’ email filtering preferences.

14 Classification of Trusted and Untrusted Senders

15 Step 3: Wantedness Analysis Measuring the senders credibility (ρ):  We believe the credibility of a sender depends on the nature of his recent emails  If the recent emails sent by the sender are legitimate, his credibility increases  If the recent emails from the sender are fraudulent, his fraudulency increases

16 Credibility Drops As Time Progresses for Untrusted Senders

17 Computing Credibility (ΔT legitimate emails ) is the average time period of all legitimate email w.r.t the most recent email (ΔT fraudulent emails ) is the average time period of all fraudulent emails w.r.t the most recent email

18 Credibility of Untrusted Senders

19 Measuring Recipient’s Wantedness  Tolerance (α + ) for a sender is more if the recipient reads and stores his emails for longer period  Intolerance (β - ) for a sender is more if the recipient deletes his emails with out reading them

20 Measuring Wantedness (ΔT legitimate emails ) is the average time period of all legitimate email w.r.t the most recent email (ΔT fraudulent emails ) is the average time period of all fraudulent emails w.r.t the most recent email T rd is the average storage time period of all the read emails T urd is the average storage time period of all unread emails

21 Wantedness of Trusted Senders

22 Classification  Classification of Phishers:  Credibility Vs Phishing Frequency  Classification of Trusted Senders:  Credibility Vs Wantedness

23 Classification of Phishers

24 Classification of Trusted Senders

25 Summary of Results # of emailsFalse PositivesFalse NegativesPrecision Corpus-I DNS Analysis119682600 85% {[DNS Analysis] + [Social Network Analysis]} 25480305 95.6% {[DNS Analysis] + [Social Network Analysis]+ [Wantedness Analysis]} 563 (Domains)0301 98.4% Corpus-II DNS Analysis75650 90.4% {[DNS Analysis] + [Social Network Analysis]} 5900 93.75% {[DNS Analysis] + [Social Network Analysis]+ [Wantedness Analysis]} 14810 99.2% Precision is the percentage of messages that were classified as phishing that actually are phishing

26 Conclusions  Phishers use special software's to conceal the path taken by their emails to reach the recipient. Most of the times the path length is single hop.  Our classifier can be used in conjunction with any existing spam filtering techniques for restricting spam and phishing emails  Rather than labeling an email as BULK, based on the sender’s credibility and his wantedness, we further classify them as:  Prospective phishers  Suspects  Recent phishers  Serial phishers  We classified two different email corpuses with a precision of 98.4% and 99.2% respectively

