11 PhishNet: Predictive Blacklisting to detect Phishing Attacks Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/4/26
2 Reference Pawan Prakash, Manish Kumar, Ramana Rao Kompella and Minaxi Gupta, “PhishNet: Predictive Blacklisting to Detect Phishing Attacks,” in IEEE INFOCOM 2010.
3 Outline Introduction Two Major Components of PhishNet ◦ URL prediction component ◦ Approximate URL matching component Evaluation Conclusion
4 Introduction Phishing attacks ◦ Set up fake web sites mimicking real businesses in order to lure innocent users into revealing sensitive information Blacklisting ◦ Match a given URL with a list of URLs belonging to a blacklist Problem of blacklisting ◦ Malicious URLs cannot be known before a certain amount of prevalence in the wild
5 Two Major Components of PhishNet URL prediction component ◦ Generate new URLs (child) from known phishing URLs (parent) by employing various heuristics ◦ Test whether the new URLs generated are indeed malicious Approximate URL matching component ◦ Perform an approximate match of a new URL with the existing blacklist
6 Component 1: Heuristics for Generating New URLs Typical blacklist URLs structure ◦ string H1: Replacing TLDs H2: IP address equivalence H3: Directory structure similarity H4: Query string substitution H5: Brand name equivalence
7 Heuristics for Generating New URLs H1: Replacing TLDs ◦ 3, 210 effective top-level domains (TLDs) ◦ Replace the effective TLD of the parent URL with 3, 209 other effective TLDs H2: IP address equivalence ◦ Phishing URLs having same IP addresses are grouped together into clusters ◦ Create new URLs by considering all combinations of hostnames and pathnames
8 Heuristics for Generating New URLs (cont’d) H3: Directory structure similarity ◦ URLs with similar directory structure are grouped together ◦ Build new URLs by exchanging the filenames among URLs belonging to the same group ◦ Parent ◦ Child
9 Heuristics for Generating New URLs (cont’d) H4: Query string substitution ◦ Build new URLs by exchanging the query strings among URLs ◦ Parent ◦ Child
10 Heuristics for Generating New URLs (cont’d) H5: Brand name equivalence ◦ Build new URLs by substituting brand names occurring in phishing URLs with other brand names
11 Component 1: Verification Conduct a DNS lookup to filter out sites that cannot be resolved For each of the resolved URLs ◦ Try to establish a connection to the corresponding server For each successful connection ◦ Initiate a HTTP GET request to obtain content from the server If the HTTP header from the server has status code 200/202 (successful request) ◦ Perform a content similarity between the parent and the child URLs If the URL’s content has sharp resemblance (above say 90%) with the parent URL ◦ Conclude that the child URL is a bad site
12 Component 2: Approximate Matching Determine whether a given URL is a phishing site or not
13 M1: Matching IP Address Perform a direct match of the IP address of URL with the IP addresses of the blacklist entries Assign a normalized score based on the number of blacklist entries that map to a given IP address If IP address IP i is common to n i URLs min{n i } (max{n i }): the minimum (maximum) of the number of phishing URLs hosted by blacklisted entries of IP addresses
14 M2: Matching Hostname Perform hostname match with those in the blacklist Domains of phishing URLs ◦ Specifically registered for hosting phishing sites ◦ Hosted on free/paidfor web-hosting services (WHS) Identify whether an incoming URL consists of a WHS or not ◦ Matching WHSes ◦ Matching non-WHSes
15 M2: Matching Hostname (cont’d)
16 M3: Matching Directory Structure Perform directory structure match with those in the blacklist Philosophy of this design ◦ H3 (directory structure similarity) ◦ H4 (query string substitution) n i : the number of URLs corresponding to a directory structure
17 M4: Matching Brand Names Check for existence of brand names in pathname and query string of URLs n i : the number of occurrences of the brand name Compute a final cumulative score ◦ Assign different weights to different modules
18 Evaluation: Component 1 Collect 6,000 URLs from PhishTank (2009/7/2 ~ 2009/7/25)
19 Evaluation: Component 2 How many benign (malicious) sites are (not) flagged as malicious Data source ◦ Phishing URLs PhishTank (consists of about 18, 000 URLs) SpamScatter (14, 000 URLs) ◦ Benign URLs DMOZ (100, 000 benign URLs ) 20, 000 benign URLs from Yahoo Random URL generator (YRUG)
20 Evaluation: Component 2 (cont’d) Training phase ◦ Create various data structures using the phishing URLs Testing phase ◦ An input URL is flagged as a phishing or a benign site Weight of individual modules ◦ W(M1, M2, M3, M4) = (1.0, 1.0, 1.5, 1.5)
21 Evaluation: Component 2 (cont’d)
22 Conclusion Address major problems associated with blacklists Two major components of PhishNet ◦ URL prediction component ◦ Approximate URL matching component Flag new URLs effectively