Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang Detecting Algorithmically Generated Domains Using Data Visualizations and N-grams Methods Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang
Contents Introduction Data Sets Classification Features Model Improvement Result Analysis Conclusion & Future Research
Introduction Example: 1002n0q11m17h017r1shexghfqf.com Dynamic Generation Algorithm (DGA) could: Generate a long list of domain names Keep sending name resolutions request Evade blacklist-based detection Identify DGA would: Help to detect Distributed Command & Control Botnets Help to monitor potential malicious activities
ClickSecurity Project Dataset Legit Domain DGA Domain Data Source Alexa Top 1M ClickSecurity Project Sample Size 1,000,000 52,665 Data Processing Remove Top-Level Domain (TLD) For example, google.com -> google Clean Data Legit domain DGA domain google 1002n0q11m17h017r1shexghfqf facebook 1002ra86698fjpgqke1cdvbk5 youtube 1008bnt1iekzdt1fqjb76pijxhr yahoo 100f3a11ckgv438fpjz91idu2ag baidu 100fjpj1yk5l751n4g9p01bgkmaf
Supervised Classification Supervised Learning Features Length of Domain Entropy of Domain Measure the randomness of information 𝐻 𝑋 = 𝑖=1 𝑛 𝑃 𝑥 𝑖 𝐼 𝑥 𝑖 =− 𝑖=1 𝑛 𝑃 𝑥 𝑖 𝑙𝑜𝑔 𝑏 𝑃( 𝑥 𝑖 )
Data frame Sample domain class length entropy theukwebdesigncompany legit 21 4.070656 texaswithlove1982-amomentlikethis 33 4.051822 congresomundialjjrperu2009 26 4.056021 a17btkyb38gxe41pwd50nxmzjxiwjwdwfrp52 dga 37 4.540402 a17c49l68ntkqnuhvkrmyb28fubvn30e31g43dq 39 4.631305 a17d60gtnxk47gskti15izhvlviyksh64nqkz 4.270132 a17erpzfzh64c69csi35bqgvp52drita67jzmy 38 4.629249 a17fro51oyk67b18ksfzoti55j36p32o11fvc29cr 41 4.305859
Data Visualization - Plotting
Prepare for Classification Re-sampling Shuffle data randomly for training/testing (80/20 splits) Choose Classification Algorithms Random Forest Support Vector Machines (SVM) Naïve Bayes
Random Forest True Positive Rate (TPR) = 31.92% Predict True dga legit All 2991 6379 9370 427 127532 127959 3418 133911 137329 True Positive Rate (TPR) = 31.92% True Negative Rate (TNR) = 99.67% False Negative Rate (FNR) = 68.08% False Positive Rate (FPR) = 0.33% False Acceptance Rate (FAR) = 4.76% False Rejection Rate (FRR) = 12.49%
SVM TPR = 12.38% TNR = 99.92% FNR = 87.62% FPR = 0.08% FAR= 6.03% Predicted True dga legit All 1160 8210 9370 105 127854 127959 1265 136064 137329 TPR = 12.38% TNR = 99.92% FNR = 87.62% FPR = 0.08% FAR= 6.03% FRR= 8.30%
Naïve Bayes TPR = 35.56% TNR = 96.04% FNR = 64.44% FPR = 3.96% Predicted True dga legit All 3332 6038 9370 5061 122898 127959 8393 128936 137329 TPR = 35.56% TNR = 96.04% FNR = 64.44% FPR = 3.96% FAR= 4.68% FRR= 60.30%
Result Comparisons Accuracy Rate We need to improve our model. Random Forest SVM Navie Bayes TPR 31.92% 12.38% 35.56% TNR 99.67% 99.92% 96.04% FNR 68.08% 87.62% 64.44% FPR 0.33% 0.08% 3.96% FAR 4.76% 6.03% 4.68% FRR 12.49% 8.30% 60.30% Accuracy Rate Error Rate We need to improve our model.
Improvement Insight Domains Introduce new features based on NGram Many DGAs are dictionary based algorithms How to measure the Similarity among domains Introduce new features based on NGram Build up Text Corpus Matrix Legit Domain Matrix Dictionary Words Matrix Calculate Similarity Score based on matrix
Similarity Score using N-Gram N-Grams of domains [N=3,4,5] Lm×n domains D1×m abc ego goo oog ogl gle goog oogl ogle … 1 google S1×n ∑ To calculate the Similarity Score: Assume all legit domains D1×m Build up N-grams of all the legit domains matrix, Lm×n Sum up the frequency of all grams, S1×n = {S1,S2… Sn } Summarize the frequency array on each domain from Alexa domain matrix, F1×n Calculate the match value 𝑀= 𝑆 1×𝑛 × [𝐹 1×𝑛 ] 𝑇 Normalize the match value to the similarity score, 𝑆𝑐𝑜𝑟𝑒= 𝑙𝑜𝑔 10 𝑀
Alexa Gram We calculate alexa_grams score for every single domains class length entropy alexa_grams investmentsonthebeach legit 21 3.368 144.722 infiniteskills 14 2.807 81.379 dticash 7 26.558 healthyliving 13 3.239 76.710 asset-cache 11 2.732 46.268 wdqdreklqnpp dga 12 3.085 11.242 wdqjkpltirjhtho 15 3.507 14.304 wdqxavemaedon 28.468 wdraokbcnspexm 3.807 25.935 wdsqfivqnqcbna 3.325 4.598
Dictionary Gram Similarly, we calculate dict_grams score for every single domains. Dictionary: 479,623 common used English word terms domain class length entropy dict_grams investmentsonthebeach legit 21 3.368 109.723 infiniteskills 14 2.807 72.786 dticash 7 23.710 healthyliving 13 3.239 61.722 asset-cache 11 2.732 31.691 wdqdreklqnpp dga 12 3.085 6.367 wdqjkpltirjhtho 15 3.507 16.554 wdqxavemaedon 28.700 wdraokbcnspexm 3.807 19.785 wdsqfivqnqcbna 3.325 3.629
New Data Frame Sample New Features Calculate N-grams for Alexa domain, alexa_grams Calculate N-grams for Dictionary, dict_grams domain class length entropy alexa_grams dict_grams investmentsonthebeach legit 21 3.368 144.722 109.723 infiniteskills 14 2.807 81.379 72.786 dticash 7 26.558 23.710 healthyliving 13 3.239 76.710 61.722 asset-cache 11 2.732 46.268 31.691 wdqdreklqnpp dga 12 3.085 11.242 6.367 wdqjkpltirjhtho 15 3.507 14.304 16.554 wdqxavemaedon 28.468 28.700 wdraokbcnspexm 3.807 25.935 19.785 wdsqfivqnqcbna 3.325 4.598 3.629
More Plot
More Plot
More Plot
More Plot
New Features in Classification Keep the same Classification Parameters Train the model with Four Features Length Entropy Alexa_grams Dict_grams
Model Improvement Compare Old New Performance Rate Random Forest SVM Naïve Bayes TPR 31.92% 12.38% 35.56% 97.53% 92.03% 76.87% TNR 99.67% 99.92% 96.04% 99.80% 99.58% 99.72% FNR 68.08% 87.62% 64.44% 2.47% 7.97% 23.13% FPR 0.33% 0.08% 3.96% 0.20% 0.42% 0.28% FAR 4.76% 6.03% 4.68% 0.18% 0.58% 1.67% FRR 12.49% 8.30% 60.30% 2.70% 5.83% Accuracy Rate Error Rate
Detail Compare – Random Forest New Model is Better.
Detail Compare – SVM New Model is Better.
Detail Compare – Naïve Bayes New Model is Better.
Conclusion Implemented machine learning in cybersecurity area Introduced two new features on dga domain classification Identified dga domains successfully Compared performance of three algorithms on new features Further research would be focus on real-time monitoring
Q & A Thank You