Detecting Phishing in s Srikanth Palla Ram Dantu University of North Texas, Denton
What is Phishing? Phishing is a form of online identity theft Employs both social engineering and technical subterfuge Targets consumers' personal identity data and financial account credentials such as credit card numbers, account usernames, passwords and social security numbers. Social-engineering schemes use 'spoofed' s to lead consumers to counterfeit websites. -Anti Phishing Working Group (APWG)
Phishing Tactics Hijacking reputable brand names Creating a plausible premise Redirecting URL’s Collecting confidential information through s
Do we need to restrict Phishing attacks?
The Statistics… Sources: Anti Phishing Working Group
Problems with Current Spam Filtering Techniques Current spam filters focus on analyzing the content Majority of the Phishers obfuscate their content to bypass the filters Labels an as BULK and expect the recipients’ to make a decision on the authenticity of the source Current spam filters have high degree of false positives
Methodology Our method examines: The header of the (not content) The social network of the recipient Credibility of the source Classifies Phishers as: Prospective Phishers Recent Phishers Suspects Serial Phishers
Traffic Profile The following Figure describes the incoming traffic profiles based on number of recipients and how often they receive the message.
Corpus Traffic Profile Our analysis requires sent folder of the recipient s provided in the TREC evaluation tool kit are spam and non spam s We require a mix of legitimate and phising s to evaluate our filter We have analyzed a live corpus of 13,843 s, collected over 2.5 years. This corpus has a mix of legitimate, spam and phishing s. Different categories of s are shown in the figure
Experimental Setup We deployed our classifier on a recipient’s local machine running an IMAP proxy and thunderbird (MUA). All the recipient’s s were fed directly into our classifier by the proxy. Our classifier periodically scans the user’s mailbox files for any new incoming s. DNS-based header analysis, social network analysis, wantedness analysis were performed on each of the s. The end result is tagging of s as either Phishing, Opt-outs, Socially distinct and Socially close.
Architecture The architecture model of our classifier consists of three analyses Step 1: DNS-based header analysis Step 2: Social network analysis Step 3: Wantedness analysis Step 4: Classification
Step 1: DNS-based Header Analysis Stage 1: In this step, we validate the information provided in the header: the hostname position of the sender, the mail server and the relays in the rest of the path. We divide the entire corpus into two buckets. The s which are valid for DNS lookups (Bucket 1). The s which are not valid for DNS lookups (Bucket 2). Stage 2: This step involves doing DNS lookup on the hostname provided in the Received: lines of the header and matching the IP address returned, with the IP address which is stored next to the hostname, by the relays during the SMTP authorization process. Bucket 1 is further divided into: Trusted bucket. Untrusted bucket. We pass the Bucket2 and both trusted and untrusted buckets to the Social Network Analysis phase for further analysis.
Step 2: Social Network Analysis Each of the three buckets: bucket2, untrusted bucket and trusted bucket received from the DNS-based header analysis are treated with the rules formulated by analyzing the “sent” folder s of the receiver. For instance, All s from trusted domains will be removed Familiarity to sender’s community Familiarity to the path traversed The rules can be built as per the recipients’ filtering preferences.
Classification of Trusted and Untrusted Senders
Step 3: Wantedness Analysis Measuring the senders credibility (ρ): We believe the credibility of a sender depends on the nature of his recent s If the recent s sent by the sender are legitimate, his credibility increases If the recent s from the sender are fraudulent, his fraudulency increases
Credibility Drops As Time Progresses for Untrusted Senders
Computing Credibility (ΔT legitimate s ) is the average time period of all legitimate w.r.t the most recent (ΔT fraudulent s ) is the average time period of all fraudulent s w.r.t the most recent
Credibility of Untrusted Senders
Measuring Recipient’s Wantedness Tolerance (α + ) for a sender is more if the recipient reads and stores his s for longer period Intolerance (β - ) for a sender is more if the recipient deletes his s with out reading them
Measuring Wantedness (ΔT legitimate s ) is the average time period of all legitimate w.r.t the most recent (ΔT fraudulent s ) is the average time period of all fraudulent s w.r.t the most recent T rd is the average storage time period of all the read s T urd is the average storage time period of all unread s
Wantedness of Trusted Senders
Classification Classification of Phishers: Credibility Vs Phishing Frequency Classification of Trusted Senders: Credibility Vs Wantedness
Classification of Phishers
Classification of Trusted Senders
Summary of Results # of sFalse PositivesFalse NegativesPrecision Corpus-I DNS Analysis % {[DNS Analysis] + [Social Network Analysis]} % {[DNS Analysis] + [Social Network Analysis]+ [Wantedness Analysis]} 563 (Domains) % Corpus-II DNS Analysis % {[DNS Analysis] + [Social Network Analysis]} % {[DNS Analysis] + [Social Network Analysis]+ [Wantedness Analysis]} % Precision is the percentage of messages that were classified as phishing that actually are phishing
Conclusions Phishers use special software's to conceal the path taken by their s to reach the recipient. Most of the times the path length is single hop. Our classifier can be used in conjunction with any existing spam filtering techniques for restricting spam and phishing s Rather than labeling an as BULK, based on the sender’s credibility and his wantedness, we further classify them as: Prospective phishers Suspects Recent phishers Serial phishers We classified two different corpuses with a precision of 98.4% and 99.2% respectively