Categorizing networks using Machine Learning

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Florida International University COP 4770 Introduction of Weka.
Naïve-Bayes Classifiers Business Intelligence for Managers.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Using Transactional Information to Predict Link Strength in Online Social Networks Indika Kahanda and Jennifer Neville Purdue University.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Danny Hendler Advanced Topics in on-line Social Networks Analysis
Big data classification using neural network
Semi-Supervised Clustering
Evaluating Classifiers
Sofus A. Macskassy Fetch Technologies
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Semi-supervised Machine Learning Gergana Lazarova
Chapter 6 Classification and Prediction
Pfizer HTS Machine Learning Algorithms: November 2002
Can Computer Algorithms Guess Your Age and Gender?
SAD: 6º Projecto.
Dipartimento di Ingegneria «Enzo Ferrari»,
Vincent Granville, Ph.D. Co-Founder, DSC
Analyzing Twitter Data
Bird-species Recognition Using Convolutional Neural Network
Mitchell Kossoris, Catelyn Scholl, Zhi Zheng
Classification of Networks using Machine Learning
Collaborative Filtering Nearest Neighbor Approach
iSRD Spam Review Detection with Imbalanced Data Distributions
Classification and Prediction
Model Evaluation and Selection
Intro to Machine Learning
Nearest Neighbors CSC 576: Data Mining.
Feature Selection for Ranking
Concave Minimization for Support Vector Machine Classifiers
Lecture 10 – Introduction to Weka
Chapter 7: Transformations
Junheng, Shengming, Yunsheng 11/09/2018
Evaluating Classifiers
Degree Distribution Ralucca Gera,
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Machine Learning for Cyber
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

Categorizing networks using Machine Learning Ralucca Gera (based on Greg Allen’s Thesis), Applied Mathematics Dept. Naval Postgraduate School Monterey, California rgera@nps.edu

HDD classification Which email addresses found on a secondary storage device are useful to a forensic analyst? identify user groups: useful (useful information about social network of the device’s user) not useful (emails that could be ignored by an analyst conducting an investigation (i.e. some_command@2x.pn or username@microsoft.com). Observation: ~95% email addresses scanned are 'not-useful' Sample useful Sample not useful

Data: Collection process Data consist of 400 graphs from 10 NPS volunteers (details on the next pages). The drives ranged in size and contained a variety of today's most popular operating systems, including Windows, OSX, and Linux. Example graph for one HDD

Data: HDD to Weighted Networks HDD (model 1) Network (model 1) HDD (model 2) Network (model 2)

Data consist from 10 NPS volunteers. Data: the 400 graphs Data consist from 10 NPS volunteers. For each HDD, use both Model 1 (within 128 bytes) and Model 2 (within 256 bytes)  10⋅2 graphs For each model, create a graph file for each of the top 20 largest connected components (for each device)  10⋅2⋅20 graphs Note: components naturally capture ‘similar’ email address (observed by Janina Green’s thesis)

Machine Learning Experiments

Used Orange (GUI)

Graph attributes Normalized the ones that were not in [0,1] Did classification using both the normalized and non-normalized data Many computationally 'cheap' (seconds to compute); some attributes costly First approach was to just use intuition to pick out attributes that would have seemed to work best; however, as research continued, more and more showed up; NetworkX provided a useful repository of algorithms, although many had to be altered to fit the data A hyperlink to the data

Each individual test for our experiments Experiment Design Questions posed: Is it possible to correctly classify a network as being useful or not, based on the graph's underlying topological structure? Does the size of the window used to create email graphs have an impact on our ability to classify them correctly? What attributes are most effective for classifying the graphs in our dataset? Does our ability to correctly classify our graphs improve when we train against a multi-class labeling scheme, as opposed to a binary scheme in which the only labels are `Useful' and `Not-Useful'? Each individual test for our experiments was repeated 10 times using the cross validation sampling method using 5 folds, and we present the results based on the average over the 10 trials.

Conducted 5 experiments Graphs used 400 Attributes used 41 4 Normalized Order: number of nodes in the component divided by the number of nodes in entire image. Normalized Size: number of edges in the component divided by the number of edges in entire image. Average degree. Density 10 Average neighbor degree (r-normalized) Pearson coefficient Transitivity Highest Betweenness Maximal matching (divided by number of edges) Maximal matching Number of nodes (percentage from entire image) Degree distribution (best fit) Degree Distribution value Algorithms 1.Classification Tree 2. SVM 3. Logistic Regression 4. Naive Bayes 128 vs. 256 yes Type of categories Useful vs. not-useful 9 categories

Results

Measuring accuracy: Precision vs. Recall Precision: % of how many of the instances that are predicted as being relevant are actually relevant Recall: % of instances that are relevant and accurately predicted as being relevant Precision = 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 Recall = 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠+𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠 F_1 = 2 1 1 𝑟𝑒𝑐𝑎𝑙𝑙 + 1 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = =2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛⋅𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Experiment 1: All Graphs, All Attributes Baseline test to determine if we could answer question 1: Can we accurately classify groups of email addresses located close to each other as being useful based on their formed graph’s underlying topological structure? Labeled each graph into Useful or Not (We know the ground) Used all 400 graphs with all 41 attributes Input Spreadsheet into Orange and ran 4 supervised machine learning algorithms: Naïve Bayes Classification Tree Logistic Regression SVMNaïve Bayes

Experiment 1: Results Naïve Bayes prediction

Experiment 2: 128 vs 256 Byte Windows Identical to Experiment except dataset divided into two subsets, one in which the graphs were constructed using a 128-byte window, and one subset made up of those graphs created with a 256-byte window

Experiment 2 Results 128 byte Window Method AUC CA F1 Precision Recall Naïve Bayes 0.963 0.929 0.522 0.353 1.000 Logistic Regression 0.747 0.974 0.600 0.750 0.500 SVM 0.913 0.987 0.833 Classification Tree 0.832 0.986 0.785 0.960 0.667 * Conclusion was that both window sizes performed well, although the Logistic Regression results do vary 256 byte window Method AUC CA F1 Precision Recall Naïve Bayes 0.957 0.917 0.480 0.316 1.000 Logistic Regression 0.997 0.994 0.923 0.857 SVM 0.900 0.962 0.625 0.500 0.833 Classification Tree 0.699 0.960 0.436 0.463 0.417

Experiment 3: Select Attributes Intent is to determine what attributes work best Tried different combinations of different attributes Started with 4 basic attributes: Order Size Average Degree Density Still Labeled each graph into Useful or Not Multiple iterations ran with different combinations of attributes

Experiment 3 - Results Minimalist – Order, Size, Average Degree, and Density (Computationally inexpensive) Method AUC CA F1 Precision Recall Naïve Bayes 0.968 0.861 0.358 0.218 1.000 Classification Tree 0.867 0.974 0.692 0.643 0.750 Logistic Regression 0.500 0.961 0.000 SVM Clearly not enough attributes to achieve any meaningful results Minimalist – plus Average Neighbor Degree Method AUC CA F1 Precision Recall Naïve Bayes 0.968 0.939 0.558 0.387 1.000 Classification Tree 0.872 0.984 0.783 0.818 0.750 Logistic Regression 0.500 0.961 0.000 SVM 0.830 0.981 0.727 0.800 0.667 Much better results with the addition of just the attribute Top 10: Density, Nodes(% of drive), Avg Neighbor Degree(r-norm), Maximal Matching/Edges, Min(r-norm), Betweenness, max_core, Pearson, Transitivity, Degree Distribution Method AUC CA F1 Precision Recall Naïve Bayes 0.966 0.935 0.545 0.375 1.000 Classification Tree 0.767 0.967 0.578 0.627 0.550 Logistic Regression 0.580 0.961 0.250 0.500 0.167 SVM 0.955 0.990 0.880 0.846 0.917 Best with the top 10 attributes; falls as the number gets > 10

Degree Distribution

Average Neighbor degree

Betweenness Centrality

Density

Pearson Corr Coeff

Transitivity

Modularity

Experiment 4: Multiple Classes We saw similar groups on different devices that kept showing up Enough to make 9 classes Owner (Useful), Database, Ubuntu, Microsoft, Certificates, Broadcast, Username, Mac Artifact, Other Reduced the previous 95% of ‘Not-Useful’ graphs to ~50%

Experiment 4 - Results 9 Classes Similar to results from Experiment 2 Method AUC CA F1 Precision Recall Naïve Bayes 0.818 0.958 0.552 0.471 0.667 Classification Tree 0.820 0.977 0.684 0.725 0.650 Logistic Regression 0.790 0.981 0.700 0.875 0.583 SVM 0.912 0.984 0.800 0.769 0.833 Similar to results from Experiment 2 8 Classes - Combine Useful and Ubuntu class into 1 (see next slide for reasoning) Method AUC CA F1 Precision Recall Naïve Bayes 0.986 0.974 0.852 0.742 1.000 Classification Tree 0.925 0.987 0.907 0.971 Logistic Regression 0.933 0.909 0.952 0.870 SVM 0.984 0.898 0.846 0.957 Better Back to binary classification scheme with 2 Classes, but now with Ubuntu and Useful combined Method AUC CA F1 Precision Recall Naïve Bayes 0.980 0.965 0.797 0.663 1.000 Classification Tree 0.941 0.982 0.877 0.861 0.894 Logistic Regression 0.975 0.994 0.954 0.955 SVM 0.970 0.984 0.896 0.843 Best

Extra Slides

Experiment 4: confusion matrix • Owner: email addresses that the owner communicated with (my_best_friend@yahoo.com) • Database: email addresses in the form of SOME_NAME@database.com • Ubuntu: email addresses in the form of username@ubuntu.com or username@debian.org • Microsoft: email addresses in the form of username@microsoft.com • Certificates: email addresses in the form of LONG_RANDOM_STRING @certificate_site.com (5E6119578EEEC642A96EC666DC7940C03C3D15BE @UPDCHU15.easf.csd.organiziation.mil) • Broadcast: email addresses in the form of some_group@some_school.edu • Username: email addreses in the form of owner’s username@some_website.com (username@google.com) • Mac Artifact: email addresses appearing to be MAC commands. some_command@2x.p

Classification Tree Classification tree for graphs with more than 20 nodes

AUC https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it

References Greg Allen -- Masters in CS (Network Science), NPS. Thesis title: Locality Based Email Clustering.  Here is a list of the attribute data he used. Janina Green - Masters in CS (Network Science), NPS. Thesis title: Constructing Social Networks from Secondary Storage with Bulk Analysis Tools