Pairwise Alignment of Metamorphic Computer Viruses Student:Scott McGhee Advisor:Dr. Mark Stamp Committee:Dr. David Taylor Dr. Teng Moh
Agenda Introduction – Virus Obfuscation Techniques – Existing Virus Detection Methods – Experimental Detection Using Hidden Markov Models – Proposed Approach Using Profile Hidden Markov Models – Op-code Sequences – Example Multiple Alignment – Pairwise Alignment
Agenda (cont’d) Creating / Scoring Alignments – Substitution Scoring – Gap Penalties – Creating a Pairwise Alignment – Creating a Multiple Alignment – Feng-Doolittle Algorithm – Sequence Preprocessing Case Studies Application Demo Conclusion
Introduction Viruses are becoming increasingly more complicated It is becoming easier for amateur programmers to create viruses using kits that are readily available online Some viruses have the capability to change itself from one generation to the next making it difficult to detect The goal is to explore a new approach to detecting these kinds viruses
Obfuscation Techniques Encrypted Viruses – Static decryptor, and an encrypted virus body – Key changes from one generation to the next – Weakness is the decryptor never changes Polymorphic Viruses – An encrypted virus with varying decryptors – Weakness is the virus body still never changes Metamorphic Viruses – Virus body can change – Assembly morphing engine – Virus Generators
Existing Virus Detection Methods Code Emulation – Simulated virtual environment – Retrieval of unencrypted form of the virus Pattern Based Scanning – Detect patterns or signatures Heuristic Analysis – Detect capabilities of an application
Experimental Approach: Using A Standard Hidden Markov Model Introduced in a previous student’s Master’s Writing Project Use a set of disassembled viruses in a particular family of viruses to train a hidden Markov model (HMM) Use the HMM to score an arbitrary assembly Designate a threshold such that if the score is over the threshold the assembly must have been a virus Promising results have been shown
Proposed Approach: Using a Profile Hidden Markov Model Instead of using a standard HMM the proposal is to use a profile HMM Profile HMMs will use position specific information within the sequence A profile HMM is trained using a multiple alignment This project will concentrate on the problem of creating multiple alignments for op-code sequences This approach is used in other fields which use sequence analysis
Op-code Sequences An application such as a virus can usually be decompiled into assembly Represent a virus as a sequence of op- codes The op-codes are parsed from the assembly Each op-code is given a representative character
Example Multiple Alignment FCDBAAE0 CDBAAEAA CDABAEAA CDABAEAA FCDB1AAEA ABAEAA CDABAEAA DBAAFAA AFABPAAEA ABAAEAA
FCDB-AAE0- -CDB-AAEAA -CDA-BAEAA -CDA-BAEAA FCDB1AAE-A -A-B-A-EAA -CDA-BAEAA --DB-AAFAA AFABPAAE-A -A-B-AAEAA
Pairwise Alignment A special case of a multiple alignment deals with only 2 sequences A pairwise alignment can be viewed as substitutions and gap insertions ABAA---ADD ABCABCD--D Substitute A with C Insert gap size 3 Insert gap size 2
Creating / Scoring Alignments
Substitution Scoring Each possible substitution can be assigned a score and placed into a substitution matrix Ideally the scores should be statistically correlated to the probability that the substitution would take place Without a comprehensive statistics on substitutions of op-codes in real viruses, these values can be guessed A simple example is given here ABCD A10-5 B 10-5 C 10-5 D 10
Gap Penalties When inserting a gap, the score will be penalized The penalty is usually a function of the length of the gap Common gap penalties include – Linear Gap Penalty, each gap has the same cost – Affine Gap Penalty, opening a gap is more expensive than extending a gap The overall score of a pairwise alignment will be the sum total of substitution scores and gap penalties
Creating a Pairwise Alignment Use Dynamic Programming optimum(X 1…m,Y 1…n ) = MAX – optimum(X 1…m-1, Y 1…n ) + cost add 1 more gap to X – optimum(X 1…m, Y 1…n-1 ) + cost add 1 more gap to Y – optimum(X 1…m, Y 1…n ) + substitution score of mth symbol in X with nth symbol of Y Can compute the optimal alignment in time O(m*n) for sequences of size m and n
Creating a Multiple Alignment Use a Progressive Alignment Choose 2 sequences to create a pairwise alignment using dynamic programming Progressively add sequences to this alignment – Choose a sequence in the alignment, and one not in the alignment – Create a pairwise alignment – Update the other sequences in the alignment with any new gaps that were inserted, add the new aligned sequence to the overall alignment
Feng-Doolittle Algorithm How do you choose the order in which you add the sequences to the MSA? If given a set of n sequences, pre-compute alignment scores between each possible pair of sequences (n choose 2 pairs) Data can be represented as a distance matrix of a fully connected graph of size n Compute a minimum spanning tree, to minimize the cost (or maximize the score)
Feng-Doolittle Algorithm (cont’d) Start with the alignment with the high scoring alignment and follow the tree
Feng-Doolittle Algorithm (cont’d) MSA Before New Alignment 5) CDABBAFCDB1AAEAA+CEDA+EQ+CDABABABALF4LBBAFBSBAAAAA 4) 2AABBAFCDABA+EAABCEDCDEQFCDABA+APALF4+BBA++SBAAAAA 8) ++AABA+CDB+AAEAA+CEDCDEQ+CDABPBA+ABF4+BBAFBSBMAAAA 3) A+ABBAFCDABA+EAA+CEDCDEQA++ABFBAN++F4+BBAFBTYBAAAA New Alignment 2) A-ABNBAFCD-BAAEAABCEDA-EQ-CDABAB--BAF4NBBM-BTYBAAAA 3) A+AB-BAFCDABA+EAA+CEDCDEQA++ABFBAN++F4+BBAFBTYBAAAA ^ (gap introduced) MSA After New Alignment 5) CDAB+BAFCDB1AAEAA+CEDA+EQ+CDABABABALF4LBBAFBSBAAAAA 4) 2AAB+BAFCDABA+EAABCEDCDEQFCDABA+APALF4+BBA++SBAAAAA 8) ++AA+BA+CDB+AAEAA+CEDCDEQ+CDABPBA+ABF4+BBAFBSBMAAAA 3) A+AB+BAFCDABA+EAA+CEDCDEQA++ABFBAN++F4+BBAFBTYBAAAA 2) A+ABNBAFCD+BAAEAABCEDA+EQ+CDABAB++BAF4NBBM+BTYBAAAA ^ (gap matched)
Sequence Preprocessing Some metamorphic viruses will permute subroutines Permuted sequences will not align well Removing the permutations in each of the sequences will produce the best alignment Using subroutine matching, a permutation can be found which will maximize the scores
Case Studies
Selected Viruses Next Generation Virus Creation Kit (NGVCK) – Advanced assembly morphing engine – Junk code insertion – Function reordering Virus Creation Lab Win 32 (VCL32) – No function reordering Phalcon/Skism Mass-Produced Code Generator (PS-MPC) – No function reordering
NGVCK Results Raw NGVCK viruses did not align well Preprocessing was required in order to create usable alignments Profile HMM was able to detect viruses with a 6.8% false-positive rate and 1% false-negative rate
VCL32 and PS-MPC The raw viruses both aligned well and did not require preprocessing VCL32 aligned the best The Profile HMM was able to detect both viruses with 0% false- positive and false- negative rates
Visual Representation of Multiple Alignments Created Raw NGVCK, groups of 20 Preprocessed NGVCK, groups of 20 PS-MPC, groups of 15 VCL32, group of 10
Application Demo
Conclusion The profile HMM works well on metamorphic viruses which do not permute subroutines Future research is needed in order to fully understand the affects of preprocessing on the profile HMM
Thank you questions to