Evolving Insider Threat Detection

Evolving Insider Threat Detection
Pallabi Parveen Dr. Bhavani Thuraisingham (Advisor) Dept of Computer Science University of Texas at Dallas Funded by AFOSR

Outline Evolving Insider threat Detection Unsupervised Learning
Our goal is to detect evolving insider threat There are two ways we can do this. 1 one is supervised learning 2. The other one is unsupervised learnig. Here I will focus on unsupervised learning. Next I will talk about supervised learning.

Evolving Insider Threat Detection
System log Feature Extraction & Selection Anomaly? j System traces System Traces weeki+1 weeki Testing on Data from weeki+1 Online learning Gather Data from Weeki Feature Extraction & Selection Learning algorithm Supervised - One class SVM, OCSVM Unsupervised - Graph based Anomaly detection, GBAD Ensemble based Stream Mining Ensemble of Models Update models

Insider Threat Detection using unsupervised Learning based on Graph
In the first part I will mention about Insider threat detection using unsupervised learning based on Graph.

Outlines: Unsupervised Learning
Insider Threat Related Work Proposed Method Experiments & Results Now I will present Outline in the context of unsupervised learning First I will talk about what is Insider threat, and how we can detect it. Second I will talk about some existing works on insider threat. Next I will talk about our proposed method for detecting insider threat. Then I will talk about our experiment and as well as about the outcome.

Definition of an Insider
An Insider is someone who exploits, or has the intention to exploit, their legitimate access to assets for unauthorised purposes attacks by people with legitimate access to an organization’s computers and networks represent a growing problem in our digital world. An Insider is someone who exploits, or has the intention to exploit, their legitimate access to assets for unauthorised purposes Insiders are not just employees today, they can include contractors, business partners, auditors... even an alumnus with a valid address. The term can also apply to an outside person who poses as an employee or officer by obtaining false credentials. An insider threat is a malicious hacker (also called a cracker or a black hat)

Insider Threat is a real threat
Computer Crime and Security Survey 2001 $377 million financial losses due to attacks 49% reported incidents of unauthorized network access by insiders There was a Computer crime and security survey on 2001, It was found that there was a financial loss of more 300 million dollars due to several attacks. Among all theses attacks it was reported that 49% of the incidents were unauthorized access by the insiders.

Insider Threat : Continue
Detection Prevention Detection based approach: Unsupervised learning, Graph Based Anomaly Detection Ensembles based Stream Mining There are two ways we can handle insider threat data, either we prevent it or detect. In my proposed work I have considered only detection of insider threat. In our proposed work we use un-supervised learning which is Graph based anomaly detection. We also considered ensemble based stream mining. I will shortly describe why we have chosen unsupervised learning and ensemble based stream mining in the next couple of slides.

Related work "Intrusion Detection Using Sequences of System Calls," Supervised learning by Hofmeyr "Mining for Structural Anomalies in Graph-Based Data Representations (GBAD) for Insider Threat Detection." Unsupervised learning by Staniford-Chen and Lawrence Holder All are static in nature. Cannot learn from evolving Data stream Some works related insider threat was done in past. Among them one is supervised learning by Hofmeyr using sequences of system calls. Another related work was unsupervised learning (GBAD) by Staniford-Chen and Lawrence Holder. But all theses experiments were done on static data and cannot learn from evolving Data stream. But our Insider threat data is dynamic in nature.

Related Approaches and comparison with proposed solutions
Techniques Proposed By Challenges Supervised/Unsuper vised Concept-drift Insider Threat Graph-based Forrest, Hofmeyr Supervised X √ Masud , Fan (Stream Mining) N/A Liu Unsupervised Holder (GBAD) Our Approach (EIT) In this table we compare between related approaches and our proposed solutions. For this we have considered four different challenge areas, like What is the learning technique(supervised or unsupervised), Does it consider concept drifts, Can it detect insider threat Is data represented in graph (WHY WE DID GRAPH BASED????????) Here one technique is proposed by Forest & Hofmeyr, they consider insider threat, they have used supervised learning, but did not consider concept drift and it was not graph based. Whereas Masud in his paper has used concept drift using supervised learning, but did not consider insider threat or it was not graph based. On the other hand, Liu and Holder both has worked on insider threat using unsupervised learning, but did not consider concept drift. In our proposed work we have considered insider threat and as well as concept drift using unsupervised learning and graph based data representation.

Why Unsupervised Learning?
One approach to detecting insider threat is supervised learning where models are built from training data. Approximately .03% of the training data is associated with insider threats (minority class) While 99.97% of the training data is associated with non insider threat (majority class). Unsupervised learning is an alternative for this. It is very difficult to detect insider threat data. Changes made are very close to original data so one can hardly recognize it. Infected data are less frequently available and dynamics in nature. In our dataset, Approximately .03% of the training data is associated with insider threats (minority class) While 99.97% of the training data is associated with non insider threat (majority class) As a result it is not worthy to use supervised learning which needs some ground truth in advance. So we have chosen unsupervised learning as an alternative for this.

Why Stream Mining All are static in nature. Cannot learn from evolving Data stream Data Chunk Previous decision boundary Current decision boundary Data Stream Anomaly Data Normal Data Instances victim of concept drift In this figure , we divide the continuous data stream into several chunks, here 1st chunk represents week 1 data, 2nd one represent week 2 data and so on. This figure demonstrates how the classifier decision boundary is changing or evolving over time (from one chunk to next chunk) And two variations of misapprehension or false detection of data points due to concept drift The dark straight line represents the decision boundary of its own chunk, whereas the dotted straight line represents the decision boundary of previous chunk. White dots represent the normal data and consider as true negative (no anomaly) in our experiment. Whereas orange dots represent anomaly data and consider as true positive. Striped dots represent the instances victim of concept drift. 1. In this figure, the decision boundary of the second chunk moves upwards compared to that of the first chunk. If we consider the decision boundary of the first chunk rather than its own , more normal data will be classified as anomaly, thus False positive (FP) will go up. 2. the decision boundary of the third chunk moves downwards compared to that of the first chunk. So, more anomaly data will be classified as normal data, thus False negative (FN) will go up. So stream mining came into place. Also, it suggests that a model built from a single chunk could not be sufficient. So for further efficiency we consider ensemble based model.

Proposed Method Graph based anomaly detection (GBAD, Unsupervised learning) [2] + Our proposed method was Graph based anomaly detection “GBAD”which is an unsupervised learning & ensemble based stream mining. In the next slide I will describe about GBAD. Ensemble based Stream Mining

GBAD Approach Determine normative pattern S using SUBDUE minimum description length (MDL) heuristic that minimizes: M(S,G) = DL(G|S) + DL(S) In GBAD, we try to find a normative pattern S within a graph G, for which if we replace it with a node, it will give the minimum description length shown in the equation. Minimum description length M is the minimum number of bits to represent a graph after being compressed by a normative substructure S

Unsupervised Pattern Discovery
Graph compression and the minimum description length (MDL) principle The best graphical pattern S minimizes the description length of S and the description length of the graph G compressed with pattern S where description length DL(S) is the minimum number of bits needed to represent S (SUBDUE) Compression can be based on inexact matches to pattern S1 S1 S1 Here we can see each of them can be replaced with s1, as they all have the same pattern.. After replacing with S1, it can be further replaced with s2. This can be a candidate for normative pattern. The pattern that gives the best result, means minimizes the summation of these two terms is taken as normative pattern. G is the entire graph, S is the substructure being analyzed, FIRST TERM IS THE is the description length of the substructure S THE SECOND TERAM Is the description length of G GIVEN S after being compressed by S S1 S1 S2 S2 S2

Three types of anomalies
Three algorithms for handling each of the different anomaly categories using Graph compression and the minimum description length (MDL) principle: GBAD-MDL finds anomalous modifications GBAD-P (Probability) finds anomalous insertions GBAD-MPS (Maximum Partial Substructure) finds anomalous deletions Once we have identify a normative pattern, if there is a deviation from the normative pattern, we call it anomaly. The algorithm can detect three types of anomalies. In other words GBAD is divided into three sub-algorithms, GBAD-MDL, it can find any modification which is not normal. GBAD-P, it can find an anomalous insertion. If anything is suspiciously deleted is detected by GBAD-MPS

Example of graph with normative pattern and different types of anomalies
GBAD-P (insertion) G C G G G A B C D A B C D A B E D A B C D A B C D Here is an example of the three methods. A-B-C-D , This is a normative pattern within this graph. Everybody has the same pattern. But in the third graph C is modified as E. It will be detected by GBAD-MDL whi.ch finds any modification is made in the normative pattern. G is inserted or added to everygraph except the fourth one. This anomalous insertion. This is detected by GBAD-P. In the last graph, we can see the link between B & D is deleted within the normative pattern. This is detected by GBAD-MPS. I will not go in details of these three algorithm, for reference, interested person can look into this paper******************* GBAD-MPS (Deletion) GBAD-MDL (modification) Normative Structure

Proposed Method Graph based anomaly detection (GBAD, Unsupervised learning) + In the next slide I will talk about ensemble based stream mining. Ensemble based Stream Mining

Characteristics of Data Stream
Continuous flow of data Examples: Data streams are Continuous flows of data For an example, network traffic, sensor data, and call center records, new data is evolving continuously. Network traffic Sensor data Call center records

DataStream Classification
Single Model Incremental classification Ensemble Model based classification Ensemble based is more effective than incremental approach. Two ways we can do data stream classification. One way, we always maintain a single model, every time we update the previous model with the new model. Another approach, we keep ensemble of fixed size models ( K models) and always drop the least accurate model with the new model in order to keep K number of models. Due to its effectiveness, in our approach we use ensemble model based classification.

Ensemble of Classifiers
+ C2 + x,? + C3 input - From the name Ensemble means a number of models are represented as classes. For a test input data x, each of the model participate in the voting process. Here, model c1 & c2 votes for positive , whereas C3 votes for negative. As majority of the classes vote for positive, the output of the test data will be positive. Individual outputs voting Ensemble output Classifier

Proposed Ensemble based Insider Threat Detection (EIT)
Maintain K GBAD models q normative patterns Majority Voting Updated Ensembles Always maintain K models Drop least accurate model We always maintain K number of models. Each model has q normative patterns. We consider majority voting of the models. When a new model arrives, we keep the new model and drop the least accurate model, Thus always maintain k number of models. I will explain shortly in the next slide how the model is updated and how we detect the anomaly data.

Ensemble based Classification of Data Streams (unsupervised Learning--GBAD)
Build a model (with q normative patterns) from each data chunk Keep the best K such model-ensemble Example: for K = 3 Data chunks D1 C1 D2 C2 D4 C4 D5 C5 D3 C3 D6 D5 D4 Update Ensemble Testing chunk In this slide, we represent Ensemble based classification of Data streams using GBAD unsupervised learning. As it is an unsupervised learning, there would be no ground truth for new data chunk. We rely on the majority voting to generate and find the true label of the new chunk. Each week data represents a chunk. From each chunk, we generate a model C which has at most q normative patterns. When a new chunk or test data arrive, say D6, each of the models, c1, c2 & c5 comes with a prediction by using its q normative patterns. The prediction will be either anomaly or normal data. The final outcome will be the majority voting of them. Each data point of D6 will get its label. Once D6 gets it label, we need to update the ensemble in order to always maintain k number of models. All the models including D6 will participate in predicting the test data D6, The model which has the least agreement with the majority voting decision will be dropped from the ensemble. In this way, we always keep K number of models. Model with Normative Patterns Prediction C4 C5 C1 C2 C4 C3 C5 Ensemble

EIT –U pseudocode Ensemble (Ensemble A, test Graph t, Chunk S)
LABEL/TEST THE NEW MODEL 1: Compute new model with q normative Substructure using GBAD from S 2: Add new model to A 3: For each model M in A 4: For each Class/ normative substructure, q in M 5: Results1  Run GBAD-P with test Graph t & q 6: Results2 Run GBAD-MDL with test Graph t & q 7: Result3 Run GBAD-MPS with test Graph t & q 8: Anomalies Parse Results (Results1, Results2, Results3) End For 9: For each anomaly N in Anomalies 10: If greater than half of the models agree 11: Agreed Anomalies  N 12: Add 1 to incorrect values of the disagreeing models 13: Add 1 to correct values of the agreeing models UPDATE THE ENSEMBLE: 14: Remove model with lowest (correct/(correct + incorrect)) ratio End Ensemble This slide shows basic building blocks of EIT algorithm We have an ensemble A, which contains K number of models, a testGraph t and a newChunk S. At line 2 , From the Chunk S, we generate a model with q normative substructure using GBAD and We add this model to the Ensemble. At line 3-8 , Each of the models in the Ensemble called three variations of GBAD, “GBAD-P, GBAD-MDL and GBAD-MPS” to predict the test Graph t whether it is an anomaly or normal. At line 11, we label the test graph t according to the majority voting of the labels. Next we update the ensemble, in other words the model which give the lowest accuracy is dropped from the ensemble shown in line 14.

Majority Voting Vs. Weighted Majority Voting
To give more importance to the most recent chunk, we use a factor upon arrival of each new chunk, the weight of previous models will be reduced by a fading factor, λ for each model Mi, its weight decreases as the corresponding chunk ages. λ ∈ (0, 1) where, 0 ≤ λ ≤1 for each Model In the majority voting process, the old chunk will carry less weight as compare to the most recent chunk. To emphasize the importance of the most recent chunk or model, we use a fading factor lamda which is between (0 & 1) upon arrival of each new chunk, the weight of previous models will be reduced by a fading factor, λ

Ensemble of Classifiers using Fading Factor 
When a test data comes each of models in the ensemble predicts the outcome. Each model participate in the majority voting, but their weight of voting is reduced by a factor lamda. For the most recent chunk or the the 7th the power factor is 7-7 =0 for lamda For the 3rd chunk M3 the power factor will be 7-3 = 4 for lamda For the first chunk the power factor woould be 7-1 = 6 for lamda In this we put more weight on the most recent chunk than the old chunk in the majority voting process

Fading Factor  and Weighted Agreement
[3], In this slide we show how we have calculated the weighted agreement value The numerator calculate weight of all the anomalies and The denominator calculate the weight of anomalies as well as normal data Here is ‘E’ is an ensemble, we calculate the weighted agreement of all the anomalies “a’ and there normalized value. Ami is set of anomalies rerported by model Mi Lamda is a constant fading factor. L is the index of most recent chunk Model Mi receives the weight lamda to the power l-i

EIT pseudocode using fading factor 

Experiments 1998 MIT Lincoln Laboratory 500,000+ vertices
K =1,3,5,7,9 Models q= 5 Normative substructures per model/ Chunk 9 weeks Each chunk covers 1 week We tested our EIT algorithm on the 1998 Lincoln Laboratory Intrusion Detection dataset. This dataset contains daily logs of all system calls made over a 9 week period. This dataset consists of 500 thousnad vertices. Each week represent a chunk or model. We have considered ensemble size as 1, 3,5 ,7 0r 9 For each model number of normative structure q can be atmost 5.

A Sample system call record from MIT Lincoln Dataset
header,150,2, execve(2),,Fri Jul 31 07:46: , + msec path,/usr/lib/fs/ufs/quota attribute,104555,root,bin, ,187986,0 exec_args,1, /usr/sbin/quota subject,2110,root,rjm,2110,rjm,280,272, return,success,0 trailer,150 This figure shows the data/token extracted for every user system call found. Each token begins with a header line and ends with a trailer line This includes the path, date, call, arguments, process ID, terminal, user ID, and return value. All of these attributes are important for different reasons. The path may indicate the importance or security level of the information being accessed or executed. Process IDs allow the tracking of all the system calls made by the same process ------The file path for exec and execve represents the actual program executed. The terminal allows the tracking of where a specific user is logging in from. The header line consists of “header”, the size of the token in bytes, audit record version number, the system call, and the time and date of the system call. The path line, if it is available, consists of “path” and the full system path of execution. The attribute line, if it is available, consists of “attribute”, the file access mode and type, owner user ID, owner group ID, file system ID, node ID, and device ID. The exec_args line, if it is available contains “exec_args” and the number of arguments. The line following the exec_args line contains the number of arguments specified in the previous line. The subject line consists of “subject”, audit ID, effective user ID, effective group ID, real user ID, real group ID, process ID, audit session ID, and the terminal port and IP address. The return line consists of “return”, the error status of the system call, and the return value of the system call. The trailer line consists of “trailer” and the size of the token in bytes. The path may indicate the importance or security level of the information being accessed or executed. A change in file path access locations or the type of system calls being executed by the same user might indicate anomalous behavior that should be investigated. The file path for exec and execve represents the actual program executed. Process IDs allow the tracking of all the system calls made by the same process, which is extremely helpful in cataloging info along with time and date. The terminal allows the tracking of where a specific user is logging in from. If that changes very frequently or very rarely, either state could possibly indicate something anomalous.

Token Sub-graph This is the graphical representation of the token, we have about 62 thousand tokens

Total False Positives/Negative
Performance Total Ensemble Accuracy # of Models Total False Positives/Negative True Positives False Positives False Negatives Normal GBAD 9 920 K=3 188 K=5 180 K=7 179 K=9 150 This Table shows the total number of true positives, false positives and false negatives found for the ensembles of different model sizes. If the ensemble size is 1 means if it is a “Single model” or we can say normal GBAD , number of false positive is 920 Whereas if the ensemble size is 3, false positive decreases to a significant amount which is 188. For ensemble size K= 5, false poisitive is 180, for k=7 it is 179 and K=9 it is 150. Also we can see that with regard to true positive or false negative, no matter wahtever is our ensemble size it remains consistent and also we have 0 false negatives.

Performance Contd.. 0 false negatives
Significant decrease in false positives Number of Model increases False positive decreases slowly after k=3

Performance Contd.. Distribution of False Positives
This figure shows the distribution of false positives over the multiple model executions for different ensemble K . The X axis represents the weeks and y axis represent the false positive. Here, we can see, initially in (week 1) for any ensemble size number of false positive is same. But after 3/4 weeks later when the ensemble size reaches its target value(say 2,3 or 4 ), the ensemble one outperform the single model significantly. There is a huge gap between the single model and ensemble model during this 9 weeks. For example, at week 6 EIT observes 6 FPs for K=3 and 25 FPs for K=1 results are reported based on individual week. Y axis represents FP on a particular week for particular ensemble size not for whole FP over a number of weeks. Hence, it does not capture any trend over a window of weeks, instead, it represents a value for a particular week. Distribution of False Positives

Performance Contd.. Summary of Dataset A & B Entry
Description—Dataset A Description—Dataset B User Donaldh William # of vertices 269 1283 # of Edges 556 469 Week 2-8 4-7 Day Friday Thursday Here we will use two subsets of dataset to do further experiment we consider activities of user Donaldh and William only. We chose their activities for Friday and Thursday respectively. This is because on that day some insider threat activities happen with those users For Donal dh the data set was taken between 2-8 weeks and for William dataset is taken betweemn 4-7 weeks.

Performance Contd.. The effect of q on TP rates for fixed K = 6 on dataset A The effect of q on FP rates for fixed K = 6 on dataset A The effect of q on runtime For fixed K = 6 on Dataset A In this slide we show what is the impact of number of normative substructures q for a fixed ensemble size k=6 on true positive and false positive.In the first figure X axis represent the number of substructure. Whereas Y axis represent the true positive. We see when we increase the number of substructures, number of true positive is also increases as well. We have observed 1, 4 and 7 TPs for q=1, q=2 and q=4 respectively. Instead of q=1 higher q may help to find right anomalies; hence increase TP. Once q exceeds 4, the TP remains constant. This is because, in the smaller dataset, higher q numbers of normative structure may not exist. In the second figure,X-axis represents # of normative substructure (q) and Y axis represents FP. For a fixed K=6, we have observed that when q is increased, FP also increases. We have observed 7, 20 and 103 FPs for q=1, q=2 and q=4 respectively. Higher q confuses the anomaly detector, thus creating more false positives. Once q>=4, the FP remains constant. This is because, in the smaller data set, more than q =4 numbers of normative structure may not exist. In the third figure,X axis represent the number of substructure. Whereas Y axis represent the processing time in second. We see when we increase the number of substructures, execution time also increases as well, in other words it take more time, which is obvious. We have observed 19 and 29 sec processing time for q=1 and q=3 respectively This demonstrates that lower value of q is not suitable for insider threat detection. Hence, we cannot choose too much high q to get high TP that will adversely affect on FP or time. As a balance between these two extremes, we choose a moderate value of q.

Performance Contd.. The effect of K on runtime for
True Positive vs # normative substructure for fixed K=6 on dataset A True Positive vs # normative substructure for fixed K=6 on dataset A Performance Contd.. In this slide we show what is the impact of ensemble size k for a fixed q =4 on true positive and time. X axis represent the number of ensemble size. Whereas Y axis represent the number of true positive We see when we increase the ensemble size, number of true positive is also increases as well, We have observed 5, 5 and 7 TPs for K=1, K=2 and K=4 respectively In the second figure, X axis represent the number of ensemble size. Whereas Y axis represent the processing time in second. We see when we increase the ensemble size, execution time also increases as well, We have observed 21 and 30 sec processing time for K=1 and K=6 respectively. With regard to the value of K (ensemble size), higher the K value helps to achieve higher TP at expense of processing time In order to avoid expensive processing time cost, we exploit moderate value of K. The effect of K on runtime for fixed q = 4 on Dataset A The effect of K on TP rates for fixed q = 4 on dataset A

Performance Contd.. Impact of fading factor  (weighted voting)

Evolving Insider Threat Detection using Supervised Learning

Outlines: Supervised Learning
Related Work Proposed Method Experiments & Results Outline of my talks are as follows: First I will talk about related work Next I will talk about our proposed method in the context of supervised learning. Then I will talk about our experiment and as well as about the outcome. Finally I will talk about our future work.

Related Approaches and comparison with proposed solutions
Techniques Proposed By Challenges Supervised/Unsupervised Concept- drift Insider Threat Graph-based Liu Unsupervised X √ Holder (GBAD) Masud , Fan (Stream Mining) Supervised N/A Forrest, Hofmeyr Our Approach (EIT-U) Our Approach (EIT-S) In this table we compare between related approaches and our proposed solutions. Previously we have shown this table and presented insider threat detection using unsupervised learning. This is the row. Now I will present Insider threat detection using supervised learning.

Why one class SVM Insider threat data is minority class
Traditional support vector machines (SVM) trained from such an imbalanced dataset are likely to perform poorly on test datasets specially on minority class One-class SVMs (OCSVM) addresses the rare-class issue by building a model that considers only normal data (i.e., non-threat data). During the testing phase, test data is classified as normal or anomalous based on geometric deviations from the model. In this slide I will discuss why who have chosen one class SVM. Insider threat data is the minority class Traditional support vector machines (SVM) trained from such an imbalanced dataset are likely to perform poorly on test datasets specially on minority class So we have chosen, One-class SVMs (OCSVM) which addresses the rare-class issue by building a model that considers only normal data (i.e., non-threat data). During the testing phase, test data is classified as normal or anomalous based on geometric deviations from the model.

Proposed Method One class SVM (OCSVM) , Supervised learning +
Ensemble based Stream Mining

One class SVM (OCSVM) f(X) = <w,x> + b
Maps training data into a high dimensional feature space (via a kernel). Then iteratively finds the maximal margin hyper plane which best separates the training data from the origin corresponds to the classification rule: For testing, f(x) < 0. we label x as an anomaly, otherwise as normal data f(X) = <w,x> + b where w is the normal vector and b is a bias term Here, the first class entails all the training data and the second class is the origin

OCSVM: Kernels Equivalent to solving the dual quadratic programming problem minα (1/2) ∑i,j αiαjK(xi,xj) s.t. 0≤αi≤1/(νl) , ∑i αi = 1 where αi - Lagrange multiplier v - parameter to control trade-off between distance of hyperplane from the origin and number of points in training dataset l - number of points in training dataset Kernel function projects input vectors into a feature space allowing for nonlinear decision boundaries Feature map: Φ: X → RN Kernel Function: K(xi,xj) = ‹Φ(xi), Φ(xj)›

Proposed Ensemble based Insider Threat Detection (EIT)
Maintain K number of OCSVM (One class SVM) models Majority Voting Updated Ensemble Always maintain K models Drop least accurate model

Ensemble based Classification of Data Streams (supervised Learning)
Divide the data stream into equal sized chunks Train a classifier from each data chunk Keep the best K OCSVM classifier-ensemble Example: for K= 3 D1 C1 D2 C2 D4 C4 D5 C5 D3 C3 D4 D6 D5 Labeled chunk Data chunks Unlabeled chunk In this slide, we present Insider threat detection using OCSVM. There are two components: One is testing first and other one is training. For Testing We rely on the majority voting to generate and find the label of the new chunk. Each week data represents a chunk. From each chunk, we generate a model C using OCSVM When a new chunk or test data arrive, say D6, each of the models, c1, c2 & c5 comes with a prediction whether it is a normal or anomaly data. The final outcome will be the majority voting of them. Each data point of D6 will get its label. Once the ground truth is available for the new chunk D6, we build the model from that chunk by using one class SVM for training purposes. Now we can update the ensemble in order to always maintain k number of models. All the models from ensemble will participate to predict D6. The model which has the least agreement with the majority voting decision will be dropped from the ensemble and the new model D6 is inserted. In this way, we always keep K number of models. Prediction C4 C5 Addresses infinite length and concept-drift Classifiers C1 C4 C2 C5 C3 Ensemble

EIT –S pseudo code (Testing)
Algorithm 1 Testing Input: A← Build-initial-ensemble() Du← latest chunk of unlabeled instances Output: Prediction/Label of Du 1: Fu Extract&Select-Features(Du) //Feature set for Du 2: for each xj∈ Fu do 3. ResultsNULL 4. for each model M in A Results Results U Prediction (xj, M) end for 6. Anomalies Majority Voting (Results)

EIT –S pseudocode Algorithm 2 Updating the classifier ensemble
Input: Dn: the most recently labeled data chunks, A: the current ensemble of best K classifiers Output: an updated ensemble A 1: for each model M ∈ A do 2: Test M on Dn and compute its expected error 3: end for 4: Mn  Newly trained 1-class SVM classifier (OCSVM) from data Dn 5: Test Mn on Dn and compute its expected error 6: A  best K classifiers from Mn ∪ A based on expected error

Time, userID, machine IP, command, argument, path, return
Feature Set extracted Time, userID, machine IP, command, argument, path, return 1 1: :1 8:1 21:1 32:1 36:0 The first number is the classification of the token as either anomalous (-1) or normal (1). The rest of the line is a list of index-value pairs, which are separated by a colon (“:”). The index represent the dimension for use by SVM, and the value is the value of the token along that dimension. “1:29669” means that the time of day (in seconds) is All of these features are important for different reasons. The time of day could indicate that the user is making system calls during normal business hours, or, alternatively, is logging in late at night, which could be anomalous. The path could indicate the security level of the system call being made – for instance, a path beginning with /sbin could indicate use of important system files, while a path like /bin/mail could indicate something more benign, like sending mail. The user ID is important to distinguish events, what is anomalous for one user may not be anomalous for another. A programmer that normally works from 9 A.M. to 5 P.M. would not be expected to login at midnight, but a maintenance technician (who performs maintenance on server equipment during off hours, at night), would.

PERFORMANCE…..

Performance Contd.. One Class SVM vs. Two Class SVM One-Class SVM
False Positives 3706 True Negatives 25701 29407 False Negatives 1 5 True Positives 4 Accuracy .87 0.99 False Positive Rate 0.13 0.0 False Negative Rate 0.20 1.0 In this table we are showing why we have used one-class SVM over two class SVM. In this slide, minority class(insider threat) can be detected well by one-class SVM. Simply, two-class SVM is unable to detect any of the positive cases correctly (True positive). Although the two-class SVM does achieve a higher accuracy, it is at the cost of having a 100% false negative rate. One-class SVM, on the other hand, achieves a moderately low false negative rate (20%), while maintaining a high accuracy (87.40%).

Performance Contd.. Updating vs Non-updating stream approach
False Positives 13774 24426 True Negatives 44362 33710 False Negatives 1 True Positives 9 Accuracy 0.76 0.58 False Positive Rate 0.24 0.42 False Negative Rate 0.1 In this table we show, why we have used ensemble based model over the single model for evolving data stream. The updating stream (EIT) achieves much higher accuracy than the non-updating stream, while maintaining an equivalent, and minimal, false negative rate (10%). The accuracy of the updating stream (EIT) is 76%, while the accuracy of the non-updating stream is 58%. We have lower false positive rate for updating stream. FPR = FP/(FP+TN) FNR = FN/(FN+TP)

Supervised (EIT-S) vs. Unsupervised(EIT-U) Learning
Performance Contd.. Supervised (EIT-S) vs. Unsupervised(EIT-U) Learning Summary of Dataset A Supervised Learning Unsupervised Learning False Positives 55 95 True Negatives 122 82 False Negatives 5 True Positives 12 7 Accuracy 0.71 0.56 False Positive Rate 0.31 0.54 False Negative Rate 0.42 Entry Description—Dataset A User Donaldh # of records 189 Week 2-7 (Friday only) In this table we show, effectiveness of supervised learning over an un-supervised learning for insider threat detection using the dataset A, which is the subset of the original dataset. Here, we use smaller dataset as described in Table 4. We chose a particular user (Donaldh) because anomaly happens only this user on week 6 at Friday. We have used the same ensemble size for both cases, K=3. For unsupervised learning we have used GBAD [42] with normative substructure q= 4. the supervised learning achieves much higher accuracy (71%) than the unsupervised learning (56%), while maintaining lower false positive rate (31%) and false negative rate (0%). On the other hand, unsupervised learning achieves 56% accuracy, 54% false positive rate and 42% false negative rate.

Performance Contd.. Accuracy of Ensemble vs # of Dimensions
True Positive vs # normative substructure for fixed K=6 on dataset A True Positive vs # normative substructure for fixed K=6 on dataset A Performance Contd.. Now, we turn our attention to varying the number of dimensions used by SVM on the accuracy of the stream ensemble. dimensions can be varied by adding dimensions for attributes or removing attributes . For instance, the path category for paths beginning with “/usr” can be made into more categories, by adding categories for subdirectories, such as “/usr/bin”, “/usr/css”, “/usr/sbin”, etc The accuracy of the ensemble appears to decrease as dimensions are added, although the accuracy is highly sensitive to even small changes in the number of dimensions. We attribute this sensitivity to the fact that some dimensions are more important than others to the accuracy of the ensemble. For instance, in moving from 7 to 15 dimensions, the command attribute was added to the feature set. The command is an important attribute and adding it increased the accuracy of the ensemble. Adding some dimensions only confuses the ensemble. These dimensions do not allow the ensemble to effectively discriminate anomalous from normal data, and only blur the distinction. For instance, in moving from 36 to 47 dimensions, the user group attribute was added to the feature set. These dimensions only confused the ensemble as a user has the same user group when executing an anomalous command as when executing a normal one. The dimensions added last were the ones deemed least essential. Our results show that special consideration should be paid to the discriminatory power of each dimension. Unnecessary dimensions simply decrease performance. Accuracy of Ensemble vs # of Dimensions

Conclusion & Future Work
Evolving Insider threat detection using Stream Mining Unsupervised learning and supervised learning Future Work: Misuse detection in mobile device Cloud computing for improving processing time. In this talk we showed how evolving insider threat can be detected by using stream mining and two learnining techniques supervised and unsupervised learning. In future we would like to extend the work in the following direction. First, we would like to apply this techniques for misuse detection in the mobile device. Second, our GBAD is a very time consuming algorithm. We would like to implement it in hadoop in cloud computing framework.

Publication Conference Papers: Pallabi Parveen, Jonathan Evans, Bhavani Thuraisingham, Kevin W. Hamlen, Latifur Khan, “ Insider Threat Detection Using Stream Mining and Graph Mining,” in Proc. of the Third IEEE International Conference on Information Privacy, Security, Risk and Trust (PASSAT 2011), October 2011, MIT, Boston, USA (full paper acceptance rate: 13%). Pallabi Parveen, Zackary R Weger, Bhavani Thuraisingham, Kevin Hamlen and Latifur Khan Supervised Learning for Insider Threat Detection Using Stream Mining, to appear in 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI2011), Nov. 7-9, 2011, Boca Raton, Florida, USA (acceptance rate is 30%) Pallabi Parveen, Bhavani M. Thuraisingham: Face Recognition Using Multiple Classifiers. ICTAI 2006, Journal: Jeffrey Partyka, Pallabi Parveen, Latifur Khan, Bhavani M. Thuraisingham, Shashi Shekhar: Enhanced geographically typed semantic schema matching. J. Web Sem. 9(1): (2011). Others: Neda Alipanah, Pallabi Parveen, Sheetal Menezes, Latifur Khan, Steven Seida, Bhavani M. Thuraisingham: Ontology-driven query expansion methods to facilitate federated queries. SOCA 2010, 1- 8 Neda Alipanah, Piyush Srivastava, Pallabi Parveen, Bhavani M. Thuraisingham: Ranking Ontologies Using Verified Entities to Facilitate Federated Queries. Web Intelligence 2010:

References W. Eberle and L. Holder, Anomaly detection in Data Represented as Graphs, Intelligent Data Analysis, Volume 11, Number 6, W. Ling Chen, Shan Zhang, Li Tu: An Algorithm for Mining Frequent Items on Data Stream Using Fading Factor. COMPSAC(2) 2009: S. A. Hofmeyr, S. Forrest, and A. Somayaji, “Intrusion Detection Using Sequences of System Calls,” Journal of Computer Security, vol. 6, pp , 1998. M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data,” Int.Conf. on Data Mining, Pisa, Italy, December 2010.

Thank You

Evolving Insider Threat Detection

Similar presentations

Presentation on theme: "Evolving Insider Threat Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evolving Insider Threat Detection

Similar presentations

Presentation on theme: "Evolving Insider Threat Detection"— Presentation transcript:

Similar presentations

About project

Feedback