1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Particle swarm optimization for parameter determination and feature selection of support vector machines Shih-Wei Lin, Kuo-Ching Ying, Shih-Chieh Chen,
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Branch and Bound Optimization In an exhaustive search, all possible trees in a search space are generated for comparison At each node, if the tree is optimal.
Feature/Model Selection by Linear Programming SVM, Combined with State-of-Art Classifiers: What Can We Learn About the Data Erinija Pranckeviciene, Ray.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Evaluating Search Engine
Introduction to Cryptography and Security Mechanisms: Unit 5 Theoretical v Practical Security Dr Keith Martin McCrea
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
Feature Selection for Regression Problems
Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.
Evaluating Hypotheses
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Automated malware classification based on network behavior
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
CLassification TESTING Testing classifier accuracy
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Network Aware Resource Allocation in Distributed Clouds.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.
Experimental Evaluation of Learning Algorithms Part 1.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Computer Systems Viruses. Virus A virus is a program which can destroy or cause damage to data stored on a computer. It’s a program that must be run in.
Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Rule-Based Method for Entity Resolution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING JANUARY 2015.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
1 Space-Efficient TCAM-based Classification Using Gray Coding Authors: Anat Bremler-Barr and Danny Hendler Publisher: IEEE INFOCOM 2007 Present: Chen-Yu.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Accuracy, Reliability, and Validity of Freesurfer Measurements David H. Salat
Computacion Inteligente Least-Square Methods for System Identification.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
October 20-23rd, 2015 Deep Neural Network Based Malware Detection Using Two Dimensional Binary Program Features Joshua Saxe, Dr. Konstantin Berlin Invincea.
Learning to Detect and Classify Malicious Executables in the Wild by J
Alan P. Reynolds*, David W. Corne and Michael J. Chantler
Session 7: Face Detection (cont.)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Artificial Immune System against Viral Attack
March 2019 Project: IEEE P Working Group for Wireless Personal Area Networks (WPANs) Submission Title: [Security vs. Sequence Length Considerations]
Presentation transcript:

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering University of Ottawa

2 Motivation Traditional anti-computer virus systems are signature- based. This technique is appropriate to detect existing viruses, but it falls short of detecting new unseen viruses or variants of existing ones. Traditional anti-computer virus systems are signature- based. This technique is appropriate to detect existing viruses, but it falls short of detecting new unseen viruses or variants of existing ones. Yet, virus writers strategically modify their viruses so that existing virus signatures do not match the new viruses. They do so in random and unpredictable ways, each time the virus replicates. Yet, virus writers strategically modify their viruses so that existing virus signatures do not match the new viruses. They do so in random and unpredictable ways, each time the virus replicates. Heuristic scanners attempt to compensate for this lacuna by using more general features from viral code. However, the process requires human intervention and falls short of yielding both good detection rates for new viruses and low false positives.  Automated searches for general features are needed. Heuristic scanners attempt to compensate for this lacuna by using more general features from viral code. However, the process requires human intervention and falls short of yielding both good detection rates for new viruses and low false positives.  Automated searches for general features are needed.

3 Purpose: To Improve on current automated search methods for general features This talk presents: This talk presents: A Feature Search and Selection approach for Virus Detection that performs an exhaustive search on a data set of viruses, yielding a large number of short generic features, that are then filtered with respect to how representative they are of viral properties. A Feature Search and Selection approach for Virus Detection that performs an exhaustive search on a data set of viruses, yielding a large number of short generic features, that are then filtered with respect to how representative they are of viral properties. A Stringent Cross-Validation scheme allowing us to simulate real-world conditions of new virus outbreaks. A Stringent Cross-Validation scheme allowing us to simulate real-world conditions of new virus outbreaks. Evidence that our Feature Selection approach has high predictive power. Evidence that our Feature Selection approach has high predictive power.

4 Background Computer Viruses are often organized within sets of Virus Families. Computer Viruses are often organized within sets of Virus Families. Virus families are characterized by their similarities in: Virus families are characterized by their similarities in: Structure Structure Code Code Methods of infection Methods of infection Consideration of Virus Families is crucial to the task of detection. Indeed, the first virus of a family is usually devastating while its family variants are typically less so. Consideration of Virus Families is crucial to the task of detection. Indeed, the first virus of a family is usually devastating while its family variants are typically less so.  Our approach uses a-priori knowledge of virus families, but our evaluation scheme focuses on evaluating classifiers in their detection of viruses of a family they were not trained on.

5 Feature Search and Selection I Our feature search and selection algorithm is comprised of three steps: Our feature search and selection algorithm is comprised of three steps: Scanning & Recording: A scanning window of length, SequenceLength, moves across the binary code, recording the frequency within each family of each sequence it encounters. Scanning & Recording: A scanning window of length, SequenceLength, moves across the binary code, recording the frequency within each family of each sequence it encounters. Selection: The features whose family frequency is at or above the threshold, IntraFamilySupport, are selected  Only the features most representative of a family are retained. Selection: The features whose family frequency is at or above the threshold, IntraFamilySupport, are selected  Only the features most representative of a family are retained. Elimination: The features that fall below the threshold, InterFamilySupport, are eliminated  Features that are too exclusive of a particular family are rejected. Elimination: The features that fall below the threshold, InterFamilySupport, are eliminated  Features that are too exclusive of a particular family are rejected.

6 Feature Search and Selection II Our Feature Search and Selection method is hierarchical, and, thus, scalable to large datasets: Our Feature Search and Selection method is hierarchical, and, thus, scalable to large datasets: The Scanning and Recording step is done only once. The Scanning and Recording step is done only once. The Selection step is conducted on small family subsets. The Selection step is conducted on small family subsets. The Elimination step is conducted on shorter feature lists. The Elimination step is conducted on shorter feature lists. Our Feature Search and Selection method ensures that all retained features represent viral properties common to many types of viruses, as opposed to idiosyncrasies specific to one family. Our Feature Search and Selection method ensures that all retained features represent viral properties common to many types of viruses, as opposed to idiosyncrasies specific to one family.

7 Evaluation Scheme I Purpose: Purpose: To simulate an environment where a virus detection system is faced with the outbreak of a new unseen virus. To simulate an environment where a virus detection system is faced with the outbreak of a new unseen virus. Procedure: Procedure: Form k- folds f 1..f k, such that Form k- folds f 1..f k, such that for each pair of folds (f i,f j ), i= 1..k, j= 1..k, and i ≠ j for each pair of folds (f i,f j ), i= 1..k, j= 1..k, and i ≠ j The set of families represented in f i is disjoint from the set of families represented in f j The set of families represented in f i is disjoint from the set of families represented in f j Benign programs are added, at random, to each fold. Benign programs are added, at random, to each fold. Perform a regular cross-validation scheme. Perform a regular cross-validation scheme.

8 Evaluation Scheme II

9 Results Traditional Feature Search (best strategy to date): retain 16-byte sequences appearing with a support of at least 1% [Schultz et al., 2001] Traditional Feature Search (best strategy to date): retain 16-byte sequences appearing with a support of at least 1% [Schultz et al., 2001] Data Set: 1512 viruses benign executables Data Set: 1512 viruses benign executables The viruses belong to 110 families. The viruses belong to 110 families. Parameter Setting: Parameter Setting: SequenceLength= 8 SequenceLength= 8 IntraFamilySupport= 40% IntraFamilySupport= 40% InterfamilySupport= 3 InterfamilySupport= 3 We obtain up to 93.65% accuracy versus 65.04% obtained by the traditional feature search approach.

10 Other Observations Extra Experiments Set-up: Extra Experiments Set-up: An additional set of experiments were performed in which the three search parameters where varied. An additional set of experiments were performed in which the three search parameters where varied. The Intra-family Support was modified according to the other two, so that a maximum of 500 features per family are selected in the second step of our algorithm. The Intra-family Support was modified according to the other two, so that a maximum of 500 features per family are selected in the second step of our algorithm. Observations: Observations: Classifiers perform better with shorter sequence length. Sequence lengths of size 5, 4 and 3 seem optimal. Classifiers perform better with shorter sequence length. Sequence lengths of size 5, 4 and 3 seem optimal. Low Inter-Family Support thresholds yield better results, especially for longer sequences. Low Inter-Family Support thresholds yield better results, especially for longer sequences. Performance generally decreases when the feature set contains fewer than 200 features. Large numbers of small features perform better than small numbers of large ones. Performance generally decreases when the feature set contains fewer than 200 features. Large numbers of small features perform better than small numbers of large ones.

11 Conclusion and Future Work Summary: Summary: Our Feature Search and Selection and Evaluation methods focus on selecting generic features useful on new, unseen families of viruses. Our Feature Search and Selection and Evaluation methods focus on selecting generic features useful on new, unseen families of viruses. Our results demonstrate the usefulness of our method in this setting. Our results demonstrate the usefulness of our method in this setting. Future Work: Future Work: To reduce the false positive rate further, using a larger number of benign files for training, or, simply stratification or cost-sensitive learning. To reduce the false positive rate further, using a larger number of benign files for training, or, simply stratification or cost-sensitive learning. To test our Feature Search and Selection method in a Retrospective Testing setting, that would involve a set of older viruses in the training set and a set of more recent ones in the test set. To test our Feature Search and Selection method in a Retrospective Testing setting, that would involve a set of older viruses in the training set and a set of more recent ones in the test set.