Pairwise Alignment of Metamorphic Computer Viruses Student:Scott McGhee Advisor:Dr. Mark Stamp Committee:Dr. David Taylor Dr. Teng Moh.

Slides:



Advertisements
Similar presentations
Smita Thaker 1 Polymorphic & Metamorphic Viruses Presented By : Smita Thaker Dated : Nov 18, 2003.
Advertisements

Ab initio gene prediction Genome 559, Winter 2011.
Polymorphic Viruses A brief survey Joseph Hamm Shirlan Johnson.
Polymorphic blending attacks Prahlad Fogla et al USENIX 2006 Presented By Himanshu Pagey.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Expected accuracy sequence alignment
. Class 5: HMMs and Profile HMMs. Review of HMM u Hidden Markov Models l Probabilistic models of sequences u Consist of two parts: l Hidden states These.
Metamorphic Malware Research
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
METAMORPHIC SOFTWARE FOR GOOD AND EVIL Wing Wong & Mark Stamp November 20, 2006.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
BNFO 602 Multiple sequence alignment Usman Roshan.
HUNTING FOR METAMORPHIC ENGINES Mark Stamp & Wing Wong August 5, 2006.
Metamorphic Viruses Pat Walpole. Introduction What are metamorphic viruses Why they are dangerous Defenses against them.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Software Uniqueness: How and Why? Puneet Mishra Dr. Mark Stamp Department of Computer Science San José State University, San José, California.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Catherine S. Grasso Christopher J. Lee Multiple Sequence Alignment Construction, Visualization, and Analysis Using Partial Order Graphs.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Masquerade Detection Mark Stamp 1Masquerade Detection.
Department of Computer Science Yasmine Kandissounon.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hunting for Metamorphic Engines Wing Wong Mark Stamp Hunting for Metamorphic Engines 1.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Hidden Markov Models for Software Piracy Detection Shabana Kazi Mark Stamp HMMs for Piracy Detection 1.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.
File Processing - Hash File Considerations MVNC1 Hash File Considerations.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Expected accuracy sequence alignment Usman Roshan.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Sequence Alignment.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Simple Substitution Distance and Metamorphic Detection Simple Substitution Distance 1 Gayathri Shanmugam Richard M. Low Mark Stamp.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
GA for Sequence Alignment  Pair-wise alignment  Multiple string alignment.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
HUNTING FOR METAMORPHIC ENGINES Mark Stamp & Wing Wong September 13, 2006.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Piecewise linear gap alignment.
Multiple sequence alignment (msa)
Ab initio gene prediction
Multiple Sequence Alignment
Presentation transcript:

Pairwise Alignment of Metamorphic Computer Viruses Student:Scott McGhee Advisor:Dr. Mark Stamp Committee:Dr. David Taylor Dr. Teng Moh

Agenda Introduction – Virus Obfuscation Techniques – Existing Virus Detection Methods – Experimental Detection Using Hidden Markov Models – Proposed Approach Using Profile Hidden Markov Models – Op-code Sequences – Example Multiple Alignment – Pairwise Alignment

Agenda (cont’d) Creating / Scoring Alignments – Substitution Scoring – Gap Penalties – Creating a Pairwise Alignment – Creating a Multiple Alignment – Feng-Doolittle Algorithm – Sequence Preprocessing Case Studies Application Demo Conclusion

Introduction Viruses are becoming increasingly more complicated It is becoming easier for amateur programmers to create viruses using kits that are readily available online Some viruses have the capability to change itself from one generation to the next making it difficult to detect The goal is to explore a new approach to detecting these kinds viruses

Obfuscation Techniques Encrypted Viruses – Static decryptor, and an encrypted virus body – Key changes from one generation to the next – Weakness is the decryptor never changes Polymorphic Viruses – An encrypted virus with varying decryptors – Weakness is the virus body still never changes Metamorphic Viruses – Virus body can change – Assembly morphing engine – Virus Generators

Existing Virus Detection Methods Code Emulation – Simulated virtual environment – Retrieval of unencrypted form of the virus Pattern Based Scanning – Detect patterns or signatures Heuristic Analysis – Detect capabilities of an application

Experimental Approach: Using A Standard Hidden Markov Model Introduced in a previous student’s Master’s Writing Project Use a set of disassembled viruses in a particular family of viruses to train a hidden Markov model (HMM) Use the HMM to score an arbitrary assembly Designate a threshold such that if the score is over the threshold the assembly must have been a virus Promising results have been shown

Proposed Approach: Using a Profile Hidden Markov Model Instead of using a standard HMM the proposal is to use a profile HMM Profile HMMs will use position specific information within the sequence A profile HMM is trained using a multiple alignment This project will concentrate on the problem of creating multiple alignments for op-code sequences This approach is used in other fields which use sequence analysis

Op-code Sequences An application such as a virus can usually be decompiled into assembly Represent a virus as a sequence of op- codes The op-codes are parsed from the assembly Each op-code is given a representative character

Example Multiple Alignment FCDBAAE0 CDBAAEAA CDABAEAA CDABAEAA FCDB1AAEA ABAEAA CDABAEAA DBAAFAA AFABPAAEA ABAAEAA

FCDB-AAE0- -CDB-AAEAA -CDA-BAEAA -CDA-BAEAA FCDB1AAE-A -A-B-A-EAA -CDA-BAEAA --DB-AAFAA AFABPAAE-A -A-B-AAEAA

Pairwise Alignment A special case of a multiple alignment deals with only 2 sequences A pairwise alignment can be viewed as substitutions and gap insertions ABAA---ADD ABCABCD--D Substitute A with C Insert gap size 3 Insert gap size 2

Creating / Scoring Alignments

Substitution Scoring Each possible substitution can be assigned a score and placed into a substitution matrix Ideally the scores should be statistically correlated to the probability that the substitution would take place Without a comprehensive statistics on substitutions of op-codes in real viruses, these values can be guessed A simple example is given here ABCD A10-5 B 10-5 C 10-5 D 10

Gap Penalties When inserting a gap, the score will be penalized The penalty is usually a function of the length of the gap Common gap penalties include – Linear Gap Penalty, each gap has the same cost – Affine Gap Penalty, opening a gap is more expensive than extending a gap The overall score of a pairwise alignment will be the sum total of substitution scores and gap penalties

Creating a Pairwise Alignment Use Dynamic Programming optimum(X 1…m,Y 1…n ) = MAX – optimum(X 1…m-1, Y 1…n ) + cost add 1 more gap to X – optimum(X 1…m, Y 1…n-1 ) + cost add 1 more gap to Y – optimum(X 1…m, Y 1…n ) + substitution score of mth symbol in X with nth symbol of Y Can compute the optimal alignment in time O(m*n) for sequences of size m and n

Creating a Multiple Alignment Use a Progressive Alignment Choose 2 sequences to create a pairwise alignment using dynamic programming Progressively add sequences to this alignment – Choose a sequence in the alignment, and one not in the alignment – Create a pairwise alignment – Update the other sequences in the alignment with any new gaps that were inserted, add the new aligned sequence to the overall alignment

Feng-Doolittle Algorithm How do you choose the order in which you add the sequences to the MSA? If given a set of n sequences, pre-compute alignment scores between each possible pair of sequences (n choose 2 pairs) Data can be represented as a distance matrix of a fully connected graph of size n Compute a minimum spanning tree, to minimize the cost (or maximize the score)

Feng-Doolittle Algorithm (cont’d) Start with the alignment with the high scoring alignment and follow the tree

Feng-Doolittle Algorithm (cont’d) MSA Before New Alignment 5) CDABBAFCDB1AAEAA+CEDA+EQ+CDABABABALF4LBBAFBSBAAAAA 4) 2AABBAFCDABA+EAABCEDCDEQFCDABA+APALF4+BBA++SBAAAAA 8) ++AABA+CDB+AAEAA+CEDCDEQ+CDABPBA+ABF4+BBAFBSBMAAAA 3) A+ABBAFCDABA+EAA+CEDCDEQA++ABFBAN++F4+BBAFBTYBAAAA New Alignment 2) A-ABNBAFCD-BAAEAABCEDA-EQ-CDABAB--BAF4NBBM-BTYBAAAA 3) A+AB-BAFCDABA+EAA+CEDCDEQA++ABFBAN++F4+BBAFBTYBAAAA ^ (gap introduced) MSA After New Alignment 5) CDAB+BAFCDB1AAEAA+CEDA+EQ+CDABABABALF4LBBAFBSBAAAAA 4) 2AAB+BAFCDABA+EAABCEDCDEQFCDABA+APALF4+BBA++SBAAAAA 8) ++AA+BA+CDB+AAEAA+CEDCDEQ+CDABPBA+ABF4+BBAFBSBMAAAA 3) A+AB+BAFCDABA+EAA+CEDCDEQA++ABFBAN++F4+BBAFBTYBAAAA 2) A+ABNBAFCD+BAAEAABCEDA+EQ+CDABAB++BAF4NBBM+BTYBAAAA ^ (gap matched)

Sequence Preprocessing Some metamorphic viruses will permute subroutines Permuted sequences will not align well Removing the permutations in each of the sequences will produce the best alignment Using subroutine matching, a permutation can be found which will maximize the scores

Case Studies

Selected Viruses Next Generation Virus Creation Kit (NGVCK) – Advanced assembly morphing engine – Junk code insertion – Function reordering Virus Creation Lab Win 32 (VCL32) – No function reordering Phalcon/Skism Mass-Produced Code Generator (PS-MPC) – No function reordering

NGVCK Results Raw NGVCK viruses did not align well Preprocessing was required in order to create usable alignments Profile HMM was able to detect viruses with a 6.8% false-positive rate and 1% false-negative rate

VCL32 and PS-MPC The raw viruses both aligned well and did not require preprocessing VCL32 aligned the best The Profile HMM was able to detect both viruses with 0% false- positive and false- negative rates

Visual Representation of Multiple Alignments Created Raw NGVCK, groups of 20 Preprocessed NGVCK, groups of 20 PS-MPC, groups of 15 VCL32, group of 10

Application Demo

Conclusion The profile HMM works well on metamorphic viruses which do not permute subroutines Future research is needed in order to fully understand the affects of preprocessing on the profile HMM

Thank you questions to