Download presentation
Presentation is loading. Please wait.
Published byMargaret Hunt Modified over 6 years ago
1
Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang
Detecting Algorithmically Generated Domains Using Data Visualizations and N-grams Methods Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang
2
Contents Introduction Data Sets Classification Features
Model Improvement Result Analysis Conclusion & Future Research
3
Introduction Example: 1002n0q11m17h017r1shexghfqf.com
Dynamic Generation Algorithm (DGA) could: Generate a long list of domain names Keep sending name resolutions request Evade blacklist-based detection Identify DGA would: Help to detect Distributed Command & Control Botnets Help to monitor potential malicious activities
4
ClickSecurity Project
Dataset Legit Domain DGA Domain Data Source Alexa Top 1M ClickSecurity Project Sample Size 1,000,000 52,665 Data Processing Remove Top-Level Domain (TLD) For example, google.com -> google Clean Data Legit domain DGA domain google 1002n0q11m17h017r1shexghfqf facebook 1002ra86698fjpgqke1cdvbk5 youtube 1008bnt1iekzdt1fqjb76pijxhr yahoo 100f3a11ckgv438fpjz91idu2ag baidu 100fjpj1yk5l751n4g9p01bgkmaf
5
Supervised Classification
Supervised Learning Features Length of Domain Entropy of Domain Measure the randomness of information π» π = π=1 π π π₯ π πΌ π₯ π =β π=1 π π π₯ π πππ π π( π₯ π )
6
Data frame Sample domain class length entropy theukwebdesigncompany
legit 21 texaswithlove1982-amomentlikethis 33 congresomundialjjrperu2009 26 a17btkyb38gxe41pwd50nxmzjxiwjwdwfrp52 dga 37 a17c49l68ntkqnuhvkrmyb28fubvn30e31g43dq 39 a17d60gtnxk47gskti15izhvlviyksh64nqkz a17erpzfzh64c69csi35bqgvp52drita67jzmy 38 a17fro51oyk67b18ksfzoti55j36p32o11fvc29cr 41
7
Data Visualization - Plotting
8
Prepare for Classification
Re-sampling Shuffle data randomly for training/testing (80/20 splits) Choose Classification Algorithms Random Forest Support Vector Machines (SVM) NaΓ―ve Bayes
9
Random Forest True Positive Rate (TPR) = 31.92%
Predict True dga legit All 2991 6379 9370 427 127532 127959 3418 133911 137329 True Positive Rate (TPR) = 31.92% True Negative Rate (TNR) = 99.67% False Negative Rate (FNR) = 68.08% False Positive Rate (FPR) = 0.33% False Acceptance Rate (FAR) = 4.76% False Rejection Rate (FRR) = 12.49%
10
SVM TPR = 12.38% TNR = 99.92% FNR = 87.62% FPR = 0.08% FAR= 6.03%
Predicted True dga legit All 1160 8210 9370 105 127854 127959 1265 136064 137329 TPR = 12.38% TNR = 99.92% FNR = 87.62% FPR = 0.08% FAR= 6.03% FRR= 8.30%
11
NaΓ―ve Bayes TPR = 35.56% TNR = 96.04% FNR = 64.44% FPR = 3.96%
Predicted True dga legit All 3332 6038 9370 5061 122898 127959 8393 128936 137329 TPR = 35.56% TNR = 96.04% FNR = 64.44% FPR = 3.96% FAR= 4.68% FRR= 60.30%
12
Result Comparisons Accuracy Rate We need to improve our model.
Random Forest SVM Navie Bayes TPR 31.92% 12.38% 35.56% TNR 99.67% 99.92% 96.04% FNR 68.08% 87.62% 64.44% FPR 0.33% 0.08% 3.96% FAR 4.76% 6.03% 4.68% FRR 12.49% 8.30% 60.30% Accuracy Rate Error Rate We need to improve our model.
13
Improvement Insight Domains Introduce new features based on NGram
Many DGAs are dictionary based algorithms How to measure the Similarity among domains Introduce new features based on NGram Build up Text Corpus Matrix Legit Domain Matrix Dictionary Words Matrix Calculate Similarity Score based on matrix
14
Similarity Score using N-Gram
N-Grams of domains [N=3,4,5] LmΓn domains D1Γm abc ego goo oog ogl gle goog oogl ogle β¦ 1 google S1Γn β To calculate the Similarity Score: Assume all legit domains D1Γm Build up N-grams of all the legit domains matrix, LmΓn Sum up the frequency of all grams, S1Γn = {S1,S2β¦ Sn } Summarize the frequency array on each domain from Alexa domain matrix, F1Γn Calculate the match value π= π 1Γπ Γ [πΉ 1Γπ ] π Normalize the match value to the similarity score, πππππ= πππ 10 π
15
Alexa Gram We calculate alexa_grams score for every single domains
class length entropy alexa_grams investmentsonthebeach legit 21 3.368 infiniteskills 14 2.807 81.379 dticash 7 26.558 healthyliving 13 3.239 76.710 asset-cache 11 2.732 46.268 wdqdreklqnpp dga 12 3.085 11.242 wdqjkpltirjhtho 15 3.507 14.304 wdqxavemaedon 28.468 wdraokbcnspexm 3.807 25.935 wdsqfivqnqcbna 3.325 4.598
16
Dictionary Gram Similarly, we calculate dict_grams score for every single domains. Dictionary: 479,623 common used English word terms domain class length entropy dict_grams investmentsonthebeach legit 21 3.368 infiniteskills 14 2.807 72.786 dticash 7 23.710 healthyliving 13 3.239 61.722 asset-cache 11 2.732 31.691 wdqdreklqnpp dga 12 3.085 6.367 wdqjkpltirjhtho 15 3.507 16.554 wdqxavemaedon 28.700 wdraokbcnspexm 3.807 19.785 wdsqfivqnqcbna 3.325 3.629
17
New Data Frame Sample New Features
Calculate N-grams for Alexa domain, alexa_grams Calculate N-grams for Dictionary, dict_grams domain class length entropy alexa_grams dict_grams investmentsonthebeach legit 21 3.368 infiniteskills 14 2.807 81.379 72.786 dticash 7 26.558 23.710 healthyliving 13 3.239 76.710 61.722 asset-cache 11 2.732 46.268 31.691 wdqdreklqnpp dga 12 3.085 11.242 6.367 wdqjkpltirjhtho 15 3.507 14.304 16.554 wdqxavemaedon 28.468 28.700 wdraokbcnspexm 3.807 25.935 19.785 wdsqfivqnqcbna 3.325 4.598 3.629
18
More Plot
19
More Plot
20
More Plot
21
More Plot
22
New Features in Classification
Keep the same Classification Parameters Train the model with Four Features Length Entropy Alexa_grams Dict_grams
23
Model Improvement Compare
Old New Performance Rate Random Forest SVM NaΓ―ve Bayes TPR 31.92% 12.38% 35.56% 97.53% 92.03% 76.87% TNR 99.67% 99.92% 96.04% 99.80% 99.58% 99.72% FNR 68.08% 87.62% 64.44% 2.47% 7.97% 23.13% FPR 0.33% 0.08% 3.96% 0.20% 0.42% 0.28% FAR 4.76% 6.03% 4.68% 0.18% 0.58% 1.67% FRR 12.49% 8.30% 60.30% 2.70% 5.83% Accuracy Rate Error Rate
24
Detail Compare β Random Forest
New Model is Better.
25
Detail Compare β SVM New Model is Better.
26
Detail Compare β NaΓ―ve Bayes
New Model is Better.
27
Conclusion Implemented machine learning in cybersecurity area
Introduced two new features on dga domain classification Identified dga domains successfully Compared performance of three algorithms on new features Further research would be focus on real-time monitoring
28
Q & A Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.