In the search of resolvers

Slides:



Advertisements
Similar presentations
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
2.1 Installing the DNS Server Role Overview of the Domain Name System Role Overview of the DNS Namespace DNS Improvements for Windows Server 2008 Considerations.
What is Statistical Modeling
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
Three kinds of learning
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 5 Introduction to DNS in Windows Server 2008.
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Data Mining: A Closer Look
Team HauntQuant October 25th, 2013
STUDENTLIFE PREDICTIVE MODELING Hongyu Chen Jing Li Mubing Li CS69/169 Mobile Health March 2015.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
1 Template-Based Classification Method for Chinese Character Recognition Presenter: Tienwei Tsai Department of Informaiton Management, Chihlee Institute.
Test cases for domain checks – a step towards a best practice Mats Dufberg,.SE Sandoche Balakrichenan, AFNIC.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Port and Message ID Analysis of Resolvers Querying.com/.net Name Servers David Blacka Matt Larson September 24, 2008 DNS OARC Meeting, Ottawa, Canada.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Unconstrained Endpoint Profiling Googling the Internet Ionut Trestian, Supranamaya Ranjan, Alekandar Kuzmanovic, Antonio Nucci Reviewed by Lee Young Soo.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
A study of caching behavior with respect to root server TTLs Matthew Thomas, Duane Wessels October 3 rd, 2015.
Module 8 DNS Tools & Diagnostics. Dig always available with BIND (*nix) and windows Nslookup available on windows and *nix Dig on windows – unpack zip,
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Data Mining and Decision Support
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
In the search of resolvers Jing Qiao 乔婧, Sebastian Castro – NZRS DNS-WG, RIPE 73, Madrid.
Experience Report: System Log Analysis for Anomaly Detection
7. Performance Measurement
Machine Learning with Spark MLlib
Geoff Huston APNIC March 2017
Configuring and Troubleshooting DNS
Transfer Learning in Astronomy: A New Machine Learning Paradigm
Chapter 6 Classification and Prediction
Statistical Techniques
SOCIAL COMPUTING Homework 3 Presentation
Categorizing networks using Machine Learning
Baselining PMU Data to Find Patterns and Anomalies
Design open relay based DNS blacklist system
NanoBPM Status and Multibunch Mark Slater, Cambridge University
Human Speech Perception and Feature Extraction
iSRD Spam Review Detection with Imbalanced Data Distributions
Supervised vs. unsupervised Learning
Classification and Prediction
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Web Mining Research: A Survey
Topological Signatures For Fast Mobility Analysis
Trust Anchor Signals from Custom Applications
Unconstrained Endpoint Profiling (Googling the Internet)‏
What part of “NO” is so hard for the DNS to understand?
Evaluation David Kauchak CS 158 – Fall 2019.
TRANCO: A Research-Oriented Top Sites Ranking Hardened Against Manipulation By Prudhvi raju G id:
Presentation transcript:

In the search of resolvers Jing Qiao 乔婧, Sebastian Castro – NZRS DNS-OARC 25, Dallas

Background Domain Popularity Ranking Derive Domain Popularity by mining DNS data Noisy nature of DNS data Certain source addresses represent resolvers, the rest a variety of behavior Can we pinpoint the resolvers? DNS-OARC 25 Oct 2016

Noisy DNS Long tail of addresses sending a few queries on a given day DNS-OARC 25 Oct 2016

Data Collection To identify resolvers, we need some data Base curated data 836 known resolvers addresses Local ISPs, Google DNS, OpenDNS 276 known non-resolvers addresses Monitoring addresses from ICANN Asking for www.zz--icann-sla-monitoring.nz Addresses sending only NS queries DNS-OARC 25 Oct 2016

Exploratory Analysis Do all resolvers behave in a similar way http://blog.nzrs.net.nz/characterization-of-popular- resolvers-from-our-point-of-view-2/ Conclusions There are some patterns Primary/secondary address Validating resolvers Resolvers in front of mail servers DNS-OARC 25 Oct 2016

Supervised classifier Can we predict if a source address is a resolver? 14 features per day per address Fraction of A, AAAA, MX, TXT, SPF, DS, DNSKEY, NS, SRV, SOA Fraction of NoError and NxDomain responses Fraction of CD and RD queries Training data Extract 1 day of DNS traffic (653,232 unique source addresses) Base data DNS-OARC 25 Oct 2016

Training Model LinearSVC __________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0) train time: 0.003s Cross-validating: Accuracy: 1.00 (+/- 0.00) CV time: 0.056s test time: 0.000s accuracy: 1.000 dimensionality: 14 density: 1.000000 classification report: precision recall f1-score support 0 1.00 1.00 1.00 73 1 1.00 1.00 1.00 206 avg / total 1.00 1.00 1.00 279 DNS-OARC 25 Oct 2016

Other Learning Algorithms For the classification problem with less than 100K samples, LinearSVC is the first choice We also benchmarked some other algorithms such as K-Neighbors and Random Forest. All of them achieved 100% accuracy DNS-OARC 25 Oct 2016

Test the model Apply the model to three different days Resolver is represented as 1, and non-resolver as 0. df = predict_result(model, "20160301") df.isresolver_predict.value_counts() 1 645060 0 8172 df = predict_result(model,"20160429") 1 529757 0 6243 df = predict_result(model,"20151212") 1 453640 0 9279 DNS-OARC 25 Oct 2016

Preliminary Analysis Most of the addresses classified as resolvers Possibly because the list of non-resolvers show a very specific behaviour Model fitting specific behaviour, leaving any other address as resolver. The next iteration of this work should to improve the training data to include different patterns. DNS-OARC 25 Oct 2016

Unsupervised classifier What if we let a classifier to learn the structure instead of imposing The same 14 features, 1 day’s DNS traffic Ignore clients that send less than 10 queries Reduce the noise Run K-Means Algorithm with K=6 Inspired by Verisign work from 2013 Calculate the percentage of clients distributed across clusters DNS-OARC 25 Oct 2016

K-Means Cost Curve DNS-OARC 25 Oct 2016

Query Type Profile per cluster DNS-OARC 25 Oct 2016

Rcode profile per cluster DNS-OARC 25 Oct 2016

Flag profile per cluster DNS-OARC 25 Oct 2016

Clustering accuracy How many known resolvers fall in the same cluster? How many known non- resolvers? Tested on both week day and weekend, 98% ~ 99% known resolvers fit in the same cluster DNS-OARC 25 Oct 2016

Client persistence Another differentiating factor could be client persistence Within a 10-day rolling window, count the addresses seen on specific number of days Addresses sending traffic all the time will fit into known resolvers and monitoring roles DNS-OARC 25 Oct 2016

Client Persistence DNS-OARC 25 Oct 2016

Resolvers persistence Do the known resolvers addresses fall into the hypothesis of persistence? What if we check their presence in different levels? DNS-OARC 25 Oct 2016

Resolvers persistence DNS-OARC 25 Oct 2016

Future work Identify unknown resolvers by checking membership to the “resolver like” cluster Exchange information with other operators about known resolvers. Potential uses: curated list of addresses, white listing, others. DNS-OARC 25 Oct 2016

Conclusions This analysis can be repeated to other ccTLDs Using open source tools Code analysis will be made available Easily adaptable to use ENTRADA DNS-OARC 25 Oct 2016

jing@nzrs.net.nz, sebastian@nzrs.net.nz DNS-OARC 25 Oct 2016