Download presentation
Presentation is loading. Please wait.
Published byLinda Jefferson Modified over 9 years ago
1
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
2
Large-Scale Automatic Classification of Phishing Pages, Colin Whittaker, Brian Ryner, Marria Nazif, NDSS '10, 2010 9/19/2015Slide 2 (of 32)
3
Introduction Phishing Classifier Infrastructure Evaluation Conclusion 9/19/2015Slide 3 (of 32)
4
Phishing is form of identity theft social engineering techniques sophisticated attack vectors To harvest financial information from unsuspecting consumers. Often a phisher tries to lure her victim into clicking a URL pointing to a rogue page. 9/19/2015Slide 4 (of 32)
5
Overall System Design Our system classifies web pages submitted by end users and collected from Gmail’s spam filters. These features describe the composition ▪ the web page’s URL ▪ the hosting of the page ▪ the page’s HTML content as collected by a crawler 9/19/2015Slide 5 (of 32)
6
Classification Workflow The first process extracts features about the URL of the page. The second process obtains domain information about the page and crawls it The final process assigns the page a score based on the collected features representing the probability that the page is phishing 9/19/2015Slide 6 (of 32)
7
Candidate URL Collection We receive new potential phishing URLs in reports ▪ from users of our blacklist ▪ from spam messages collected by Gmail 9/19/2015Slide 7 (of 32)
8
URL Feature Extraction The first process in the workflow, the URL Feature Extractor, looks only at the URL of the page to determine features. If it matches a whitelist of high profile, safe sites, then the URL Feature Extractor drops the URL from the workflow entirely. We manually compile this whitelist of 2778 sites 9/19/2015Slide 8 (of 32)
9
URL Feature Extraction One feature this process extracts is whether the URL contains an IP address for its hostname. 9/19/2015Slide 9 (of 32)
10
URL Feature Extraction ▪ Another feature this process extracts is whether the page has many host components ▪ Phishers commonly use a long hostname, prepending an authentic-sounding host to their fixed domain name, to confuse viewers into believing that the page is legitimate. 9/19/2015Slide 10 (of 32)
11
URL Feature Extraction Phishers often include characteristic strings in their URLs to mislead viewers. These can include the trademarks of the phishing target, like “abbeynational” in the example above, or more general phrases associated with phishing targets, like “login”. The feature extractor transforms each of these tokens into a boolean feature, such as “The path contains the token ‘login.’” 9/19/2015Slide 11 (of 32)
12
Fetching Page Content The URL Feature Extractor also collects URL metadata, including PageRank, from Google proprietary infrastructure We also use a domain reputation score computed by the Gmail anti spam system as a feature. ▪ This score is roughly the percentage of emails from a domain which are not spam 9/19/2015Slide 12 (of 32)
13
Hosting and Page Feature Extraction The Content Fetcher process crawls the page and gathers its hosting information. ▪ It records the returned IPs, name servers, and name server IPs. ▪ It also geo locates these IPs, recording the city, region, and country 9/19/2015Slide 13 (of 32)
14
Machine Learning and Bioinformatics Laboratory Hosting and Page Feature Extraction The Content Fetcher sends the URL to a pool of headless web browsers to render the page content. After the browser renders the page, the Content Fetcher receives and records the page HTML, as well as all iframe, image, and javascript content embedded in the page 9/19/2015Slide 14 (of 32)
15
Machine Learning and Bioinformatics Laboratory Page Classification To compute the score for the page in log odds, the classifier combines these values using a logistic regression The score translates to the computed probability that the page is phishing 9/19/2015Slide 15 (of 32)
16
Page Classification Before the classifier automatically blacklists the page, it checks to make sure that the page does not have a high PageRank 9/19/2015Slide 16 (of 32)
17
Evaluation Dataset First ▪ contains data collected between April 16, 2009 and July 14, 2009 with labes from July 15, 2009. ▪ examine our selected features and train our evaluation models Second ▪ collected during the first two weeks of August, 2009, as a validation dataset. 9/19/2015Slide 17 (of 32)
18
9/19/2015Slide 18 (of 32)
19
9/19/2015Slide 19 (of 32)
20
9/19/2015Slide 20 (of 32)
21
9/19/2015Slide 21 (of 32)
22
Machine Learning and Bioinformatics Laboratory 9/19/2015Slide 22 (of 32)
23
Machine Learning and Bioinformatics Laboratory 9/19/2015Slide 23 (of 32)
24
9/19/2015Slide 24 (of 32)
25
9/19/2015Slide 25 (of 32)
26
9/19/2015Slide 26 (of 32)
27
9/19/2015Slide 27 (of 32)
28
9/19/2015Slide 28 (of 32)
29
9/19/2015Slide 29 (of 32)
30
9/19/2015Slide 30 (of 32)
31
we describe our large-scale system for automatically classifying phishing pages which maintains a false positive rate below 0.1%. Our classification system examines millions of potential phishing pages daily in a fraction of the time of a manual review process 9/19/2015Slide 31 (of 32)
32
9/19/2015Slide 32 (of 32)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.