Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang

Slides:



Advertisements
Similar presentations
Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.
Advertisements

LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Vote Calibration in Community Question-Answering Systems Bee-Chung Chen (LinkedIn), Anirban Dasgupta (Yahoo! Labs), Xuanhui Wang (Facebook), Jie Yang (Google)
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Anomaly detection Problem motivation Machine Learning.
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Spamming Botnets: Signatures and Characteristics Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Microsoft Research, Silicon Valley Geoff Hulten,
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Canadian Bioinformatics Workshops
DOWeR Detecting Outliers in Web Service Requests Master’s Presentation of Christian Blass.
A Simple Approach for Author Profiling in MapReduce
Under the Shadow of sunshine
Learning to Detect and Classify Malicious Executables in the Wild by J
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Elizabeth R McMahon 14 April 2017
How to forecast solar flares?
Machine Learning – Classification David Fenyő
Evaluating Classifiers
Large-Scale Content-Based Audio Retrieval from Text Queries
A Straightforward Author Profiling Approach in MapReduce
CSSE463: Image Recognition Day 11
Performance Measures II
Source: Procedia Computer Science(2015)70:
KDD 2004: Adversarial Classification
Basic machine learning background with Python scikit-learn
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Unknown Malware Detection Using Network Traffic Classification
Data Mining Classification: Alternative Techniques
Features & Decision regions
A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.
PEBL: Web Page Classification without Negative Examples
Improved Rooftop Detection in Aerial Images with Machine Learning
CSSE463: Image Recognition Day 11
TED Talks – A Predictive Analysis Using Classification Algorithms
Design open relay based DNS blacklist system
iSRD Spam Review Detection with Imbalanced Data Distributions
Implementing AdaBoost
Family History Technology Workshop
Model Evaluation and Selection
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Jia-Bin Huang Virginia Tech
Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun
Discriminative Training
Credit Card Fraudulent Transaction Detection
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Austin Karingada, Jacob Handy, Adviser : Dr
An introduction to Machine Learning (ML)
Presentation transcript:

Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang Detecting Algorithmically Generated Domains Using Data Visualizations and N-grams Methods Author: Tianyu Wang and Li-Chiou Chen Presenter: Tianyu Wang

Contents Introduction Data Sets Classification Features Model Improvement Result Analysis Conclusion & Future Research

Introduction Example: 1002n0q11m17h017r1shexghfqf.com Dynamic Generation Algorithm (DGA) could: Generate a long list of domain names Keep sending name resolutions request Evade blacklist-based detection Identify DGA would: Help to detect Distributed Command & Control Botnets Help to monitor potential malicious activities

ClickSecurity Project Dataset Legit Domain DGA Domain Data Source Alexa Top 1M ClickSecurity Project Sample Size 1,000,000 52,665 Data Processing Remove Top-Level Domain (TLD) For example, google.com -> google Clean Data Legit domain DGA domain google 1002n0q11m17h017r1shexghfqf facebook 1002ra86698fjpgqke1cdvbk5 youtube 1008bnt1iekzdt1fqjb76pijxhr yahoo 100f3a11ckgv438fpjz91idu2ag baidu 100fjpj1yk5l751n4g9p01bgkmaf

Supervised Classification Supervised Learning Features Length of Domain Entropy of Domain Measure the randomness of information 𝐻 𝑋 = 𝑖=1 𝑛 𝑃 𝑥 𝑖 𝐼 𝑥 𝑖 =− 𝑖=1 𝑛 𝑃 𝑥 𝑖 𝑙𝑜𝑔 𝑏 𝑃( 𝑥 𝑖 )

Data frame Sample domain class length entropy theukwebdesigncompany legit 21 4.070656 texaswithlove1982-amomentlikethis 33 4.051822 congresomundialjjrperu2009 26 4.056021 a17btkyb38gxe41pwd50nxmzjxiwjwdwfrp52 dga 37 4.540402 a17c49l68ntkqnuhvkrmyb28fubvn30e31g43dq 39 4.631305 a17d60gtnxk47gskti15izhvlviyksh64nqkz 4.270132 a17erpzfzh64c69csi35bqgvp52drita67jzmy 38 4.629249 a17fro51oyk67b18ksfzoti55j36p32o11fvc29cr 41 4.305859

Data Visualization - Plotting

Prepare for Classification Re-sampling Shuffle data randomly for training/testing (80/20 splits) Choose Classification Algorithms Random Forest Support Vector Machines (SVM) Naïve Bayes

Random Forest True Positive Rate (TPR) = 31.92% Predict True dga legit All 2991 6379 9370 427 127532 127959 3418 133911 137329 True Positive Rate (TPR) = 31.92% True Negative Rate (TNR) = 99.67% False Negative Rate (FNR) = 68.08% False Positive Rate (FPR) = 0.33% False Acceptance Rate (FAR) = 4.76% False Rejection Rate (FRR) = 12.49%

SVM TPR = 12.38% TNR = 99.92% FNR = 87.62% FPR = 0.08% FAR= 6.03% Predicted True dga legit All 1160 8210 9370 105 127854 127959 1265 136064 137329 TPR = 12.38% TNR = 99.92% FNR = 87.62% FPR = 0.08% FAR= 6.03% FRR= 8.30%

Naïve Bayes TPR = 35.56% TNR = 96.04% FNR = 64.44% FPR = 3.96% Predicted True dga legit All 3332 6038 9370 5061 122898 127959 8393 128936 137329 TPR = 35.56% TNR = 96.04% FNR = 64.44% FPR = 3.96% FAR= 4.68% FRR= 60.30%

Result Comparisons Accuracy Rate We need to improve our model. Random Forest SVM Navie Bayes TPR 31.92% 12.38% 35.56% TNR 99.67% 99.92% 96.04% FNR 68.08% 87.62% 64.44% FPR 0.33% 0.08% 3.96% FAR 4.76% 6.03% 4.68% FRR 12.49% 8.30% 60.30% Accuracy Rate Error Rate We need to improve our model.

Improvement Insight Domains Introduce new features based on NGram Many DGAs are dictionary based algorithms How to measure the Similarity among domains Introduce new features based on NGram Build up Text Corpus Matrix Legit Domain Matrix Dictionary Words Matrix Calculate Similarity Score based on matrix

Similarity Score using N-Gram   N-Grams of domains [N=3,4,5] Lm×n domains D1×m abc ego goo oog ogl gle goog oogl ogle … 1 google S1×n ∑ To calculate the Similarity Score: Assume all legit domains D1×m Build up N-grams of all the legit domains matrix, Lm×n Sum up the frequency of all grams, S1×n = {S1,S2… Sn } Summarize the frequency array on each domain from Alexa domain matrix, F1×n Calculate the match value 𝑀= 𝑆 1×𝑛 × [𝐹 1×𝑛 ] 𝑇 Normalize the match value to the similarity score, 𝑆𝑐𝑜𝑟𝑒= 𝑙𝑜𝑔 10 𝑀

Alexa Gram We calculate alexa_grams score for every single domains class length entropy alexa_grams investmentsonthebeach legit 21 3.368 144.722 infiniteskills 14 2.807 81.379 dticash 7 26.558 healthyliving 13 3.239 76.710 asset-cache 11 2.732 46.268 wdqdreklqnpp dga 12 3.085 11.242 wdqjkpltirjhtho 15 3.507 14.304 wdqxavemaedon 28.468 wdraokbcnspexm 3.807 25.935 wdsqfivqnqcbna 3.325 4.598

Dictionary Gram Similarly, we calculate dict_grams score for every single domains. Dictionary: 479,623 common used English word terms domain class length entropy dict_grams investmentsonthebeach legit 21 3.368 109.723 infiniteskills 14 2.807 72.786 dticash 7 23.710 healthyliving 13 3.239 61.722 asset-cache 11 2.732 31.691 wdqdreklqnpp dga 12 3.085 6.367 wdqjkpltirjhtho 15 3.507 16.554 wdqxavemaedon 28.700 wdraokbcnspexm 3.807 19.785 wdsqfivqnqcbna 3.325 3.629

New Data Frame Sample New Features Calculate N-grams for Alexa domain, alexa_grams Calculate N-grams for Dictionary, dict_grams domain class length entropy alexa_grams dict_grams investmentsonthebeach legit 21 3.368 144.722 109.723 infiniteskills 14 2.807 81.379 72.786 dticash 7 26.558 23.710 healthyliving 13 3.239 76.710 61.722 asset-cache 11 2.732 46.268 31.691 wdqdreklqnpp dga 12 3.085 11.242 6.367 wdqjkpltirjhtho 15 3.507 14.304 16.554 wdqxavemaedon 28.468 28.700 wdraokbcnspexm 3.807 25.935 19.785 wdsqfivqnqcbna 3.325 4.598 3.629

More Plot

More Plot

More Plot

More Plot

New Features in Classification Keep the same Classification Parameters Train the model with Four Features Length Entropy Alexa_grams Dict_grams

Model Improvement Compare Old New Performance Rate Random Forest SVM Naïve Bayes TPR 31.92% 12.38% 35.56% 97.53% 92.03% 76.87% TNR 99.67% 99.92% 96.04% 99.80% 99.58% 99.72% FNR 68.08% 87.62% 64.44% 2.47% 7.97% 23.13% FPR 0.33% 0.08% 3.96% 0.20% 0.42% 0.28% FAR 4.76% 6.03% 4.68% 0.18% 0.58% 1.67% FRR 12.49% 8.30% 60.30% 2.70% 5.83% Accuracy Rate Error Rate

Detail Compare – Random Forest New Model is Better.

Detail Compare – SVM New Model is Better.

Detail Compare – Naïve Bayes New Model is Better.

Conclusion Implemented machine learning in cybersecurity area Introduced two new features on dga domain classification Identified dga domains successfully Compared performance of three algorithms on new features Further research would be focus on real-time monitoring

Q & A Thank You