Mining Anomalies Using Traffic Feature Distributions Anukool Lakhina Mark Crovella Christophe Diot in ACM SIGCOMM 2005 Presented by: Sailesh Kumar.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.

EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.

Clustering Basic Concepts and Algorithms

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Detectability of Traffic Anomalies in Two Adjacent Networks Augustin Soule, Haakon Ringberg, Fernando Silveira, Jennifer Rexford, Christophe Diot.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

Principal Component Analysis

1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

Project 4 out today –help session today –photo session today Project 2 winners Announcements.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?

Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)

Separate multivariate observations

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Summarized by Soo-Jin Kim

Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)

Principal Components Analysis (PCA). a technique for finding patterns in data of high dimension.

A Statistical Anomaly Detection Technique based on Three Different Network Features Yuji Waizumi Tohoku Univ.

Presented By Wanchen Lu 2/25/2013

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

DoWitcher: Effective Worm Detection and Containment in the Internet Core S. Ranjan et. al in INFOCOM 2007 Presented by: Sailesh Kumar.

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.

SINGULAR VALUE DECOMPOSITION (SVD)

1 Distributed Detection of Network-Wide Traffic Anomalies Ling Huang* XuanLong Nguyen* Minos Garofalakis § Joe Hellerstein* Michael Jordan* Anthony Joseph*

Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.

CSE 185 Introduction to Computer Vision Face Recognition.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Mining Anomalies in Network-Wide Flow Data Anukool Lakhina with Mark Crovella and Christophe Diot NANOG35, Oct 23-25, 2005.

SAR-ATR-MSTAR TARGET RECOGNITION FOR MULTI-ASPECT SAR IMAGES WITH FUSION STRATEGIES ASWIN KUMAR GUTTA.

EE515/IS523: Security 101: Think Like an Adversary Evading Anomarly Detection through Variance Injection Attacks on PCA Benjamin I.P. Rubinstein, Blaine.

Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.

Principal Components Analysis ( PCA)

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:

Principal Component Analysis (PCA)

Unsupervised Learning

PREDICT 422: Practical Machine Learning

Parallel Autonomous Cyber Systems Monitoring and Protection

Lecture 8:Eigenfaces and Shared Features

Face Recognition and Feature Subspaces

Recognition: Face Recognition

Principal Component Analysis (PCA)

Principal Component Analysis

Dimension reduction : PCA and Clustering

Feature space tansformation methods

CS4670: Intro to Computer Vision

Announcements Project 2 artifacts Project 3 due Thursday night

Announcements Project 4 out today Project 2 winners help session today

Announcements Artifact due Thursday

Announcements Artifact due Thursday

The “Margaret Thatcher Illusion”, by Peter Thompson

Marios Mattheakis and Pavlos Protopapas

Unsupervised Learning

Presentation transcript:

Mining Anomalies Using Traffic Feature Distributions Anukool Lakhina Mark Crovella Christophe Diot in ACM SIGCOMM 2005 Presented by: Sailesh Kumar

2 - Sailesh Kumar - 12/13/2015 Overview n Introduction to PCA n Application of PCA to OD flows »Lakhina et al. SIGMETRICS’04 n Volume Anomalies n Subspace Analysis of Link Volume Traffic »Lakhina et al. SIGCOMM’04 n Feature based Anomaly Detection n Anomaly Classification »Lakhina et al. SIGCOMM’05

3 - Sailesh Kumar - 12/13/2015 n PCA is a useful statistical technique that has found application in fields such as face recognition and image compression. n It is a technique for finding patterns in data of high dimension. n Two important terms: Covariance Matrix Eigenvector/Eigenvalue Principal Component Analysis Like variance provides the relationship between data items in a single dimension, covariance provides the relationship between different dimensions of multi-dimensional data A nxn matrix has n eigenvectors. All eigenvectors are orthogonal to each other.

4 - Sailesh Kumar - 12/13/2015 n PCA is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. n Since patterns in data can be hard to find in data of high dimension, where graphical representation is not available, PCA is a powerful tool for analyzing data. n It essentially reduces the dimension if data along multiple dimensions are correlated n A simple example of 2-D correlated data »(No of hours studied, Marks obtained in exam) »(Traffic entering in a network, traffic exiting a network) »Such set of data may exhibit strong correlation (positive correlation in this instance) »Thus, it might be worthwhile describing this data set as a single dimension data. Principal Component Analysis

5 - Sailesh Kumar - 12/13/ D data and Principal Components

6 - Sailesh Kumar - 12/13/2015 One Dimensional Projection

7 - Sailesh Kumar - 12/13/2015 n Consider a p-dimensional data n If data along the p dimensions are correlated (high positive or negative covariance), then it can be represented with fewer dimensions (k) n In general any p dimensional data set can be mapped onto first k principal axes »First k principal components with the highest eigenvalues n Data mapped onto the k dimensions are usually called the normal component n Remaining data is called the residual component Principal Component Analysis

8 - Sailesh Kumar - 12/13/2015 n OD flow is the traffic that enters at an origin PoP and exits at a destination PoP of a backbone network. n Relationship between link traffic and OD flow traffic is captured by the routing matrix A. »A has size (#links) x (# OD flows) »A ij = 1 if OD flow j traverses through link i. »Traffic engineering is essentially adjusting the matrix A. n A network with n PoP will have n 2 OD flows. »Thus OD flows are high dimensional data. –20 PoP will result in 400 dimensions. n However, quite intuitively OD flows are correlated. »Hence they can be represented with far fewer dimensions. »Lakhina et al. (SIGMETRICS’04) shows it. OD Flows

9 - Sailesh Kumar - 12/13/2015 n Lakhina et al. (SIGMETRICS’04) shows that only 5-10 dimensions are sufficient to capture 95+% of the traffic OD Flows

10 - Sailesh Kumar - 12/13/2015 n Volume anomaly typically arises on an OD flow (traffic arriving at one PoP and destined for another PoP) n If we only monitor traffic on network links, volume arising from an OD flow may not be noticeable »Thus, naïve approach won’t work if OD flow info isn’t available Why care about OD Flows

11 - Sailesh Kumar - 12/13/2015 n Even if OD flow information is not available, and only link traffic information is available, PCA can be applied and subspace technique can detect volume anomalies n What is the data »Data consist of time samples of traffic volumes at all m links in the network »Thus, Y is the t x m traffic measurement matrix –An arbitrary row y of Y denotes one sample n Use PCA to separate normal and anomalous traffic n Construct the principal axes and map data onto them n Consider set of first k axes which captures the highest variance »Projection of y on these k axes is called normal traffic while remaining traffic is residual traffic Subspace Analysis of Link Traffic

12 - Sailesh Kumar - 12/13/2015 Subspace Analysis of Link Traffic n An approach to separate normal traffic from anomalous traffic n Normal Subspace, : space spanned by the first k principal components n Anomalous Subspace, : space spanned by the remaining principal components n Then, decompose traffic on all links by projecting onto and to obtain: Traffic vector of all links at a particular point in time Normal traffic vector Residual traffic vector

13 - Sailesh Kumar - 12/13/2015 n Note that during anomaly, normal component doesn’t change that much while residual component changes quite a lot »Thus, anomalies can be detected by setting some threshold Subspace Analysis Results

14 - Sailesh Kumar - 12/13/2015 n Lets talk about today’s paper now! n Objective is to build a anomaly diagnosis system »detects a diverse range of anomalies, »distinguishes between different types of anomalies, »and group similar anomalies n Clearly these goals are too ambitious »Anomalies are a moving target (malicious anomalies) »New anomalies will continue to arise »In general this is a difficult problem n This paper takes significant steps towards a system that fulfills these criteria. Background Over

15 - Sailesh Kumar - 12/13/2015 n Most Anomalies induce a change in distributional aspects of packet header fields (called features). »DOS attack – multiple source IP address concentrated on a single destination IP address »Network scan – dispersed distribution of destination addresses »Most worms/viruses also induce some change in distribution of certain features »However these changes can be very subtle and mining them is like searching for needles in a haystack n Unlike many previous approach, this paper aims to detect events which disturb the distribution of traffic features rather than traffic volume Typical Characteristics of Anomaly

16 - Sailesh Kumar - 12/13/2015 n Port scan anomaly (traffic feature changes, however traffic volume remains more or less the same) Limitations of Volume Based Detection We can use entropy to capture the variations in the traffic feature Takes value 0 when distribution is maximally concentrated. Takes value log 2 N when distribution is maximally dispersed.

17 - Sailesh Kumar - 12/13/2015 Effectiveness of Feature Entropy But stands out in feature entropy, which also reveals its structure Port scan dwarfed in volume metrics…

18 - Sailesh Kumar - 12/13/2015 n In volume based scheme, # of packets or bytes per time slot was the variable. n In entropy based scheme, in every time slot, the entropy of every traffic feature is the variable. n This gives us a three way data matrix H. »H(t, p, k) denotes at time t, the entropy of OD flow p, of the traffic feature k. n To apply subspace method, we need to unfold it into a single-way representation. Entropy based scheme

19 - Sailesh Kumar - 12/13/2015 n Decompose into a single-way matrix n Now apply the usual subspace decomposition »Every row of the matrix will be decomposed into Multi-way to single-way Traffic vector of all links at a particular point in time Normal traffic vector Residual traffic vector

20 - Sailesh Kumar - 12/13/2015 n The application of entropy on both traffic features and the ensemble of OD flows has a key benefit that correlated anomalies across both OD flows and features stand-out. n Moreover, as we have shown in the first example, entropy is an effective summarization tool, when traffic volume changes are not significant. n We now evaluate how this scheme improves over volume based schemes Benefits of using Features + OD flow

21 - Sailesh Kumar - 12/13/2015 Entropy Based versus Volume Based

22 - Sailesh Kumar - 12/13/2015 Detection Rates

23 - Sailesh Kumar - 12/13/2015 n Cluster Anomaly which are close enough in space. n Anomalies can be thought of as a point in 4-D space with co-ordinate vectors »h = [H(srcIP), H(dstIP), H(srcPort), H(dstPort)] n Do anomalies of similar type appear next to each other in the entropy space »YES! n Before digging into details, a brief Introduction of Clustering Algorithms Anomaly Classification

24 - Sailesh Kumar - 12/13/2015 n Objective is to cluster close enough data points n Two general approach to cluster »K-means and Hierarchical n K-means is one of the simplest unsupervised learning algorithm. 1. Place K points into the space represented by the data that are being clustered. These points form initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. n Hierarchical approach begins with one cluster and breaks it into multiple clusters OR begins with n clusters and merge different clusters Clustering Algorithms

25 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 2, k = 7 clusters. From

26 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 3, k = 6 clusters. From

27 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 4, k = 5 clusters. From

28 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 5, k = 4 clusters. From

29 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 6, k = 3 clusters. From

30 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 7, k = 2 clusters. From

31 - Sailesh Kumar - 12/13/2015 Nearest Neighbor, Level 8, k = 1 cluster. From

32 - Sailesh Kumar - 12/13/2015 Clustering Anomalies Summary: Correctly classified 292 of 296 injected anomalies ( DstIP ) ( SrcIP ) Known LabelsCluster Results Legend Code Red Scanning Single source DOS attack Multi source DOS attack

33 - Sailesh Kumar - 12/13/2015 Clustering Anomalies ( SrcIP ) ( SrcPort ) ( DstIP ) –Results of both clustering algorithms are consistent Heuristics identify about 10 clusters in dataset

34 - Sailesh Kumar - 12/13/2015 Questions?