Download presentation
Presentation is loading. Please wait.
Published byKayla Armstrong Modified over 11 years ago
1
U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors
2
Research Question Can we reduce redundant analysis by finding common behaviors in malware instances?
3
Malware Analysis Dynamic Analysis: Run the malware instance (binary) in a controlled environment –Log all events (registry, memory, sockets, etc.) –Analyze logs for malicious behavior –Find similar malware instances based on runtime behavior
4
Malware Analysis Event Logs 00 01 02 03 … Malware A Event Codes Initialize network socket Establish connection to malicious.com Load library Sleep
5
Malware Instance Similarity Event n-grams (Rieck et al. 2010) –Find common n-grams (or sequences of events) in event logs 01, 02; 02, 03; 2-grams for Malware A / Malware B 00 01 02 03 … Malware A 01 02 03 02 … Malware B 04 02 05 01 02 … Malware C 01, 02; 2-grams for Malware A / Malware C 01, 02; 2-grams for Malware B / Malware C Events Codes
6
Malware Instance Similarity Event n-grams (Rieck 2010) –Find common fixed size n-grams (or sequences of events) in event logs Malware A / Malware B are more likely to be of the same type 01, 02; 02, 03; 2-grams for Malware A / Malware B 00 … 01 02 03 … Malware A 01 … 02 03 02 … Malware B 04 … 02 05 01 02 … Malware C 01, 02; 2-grams for Malware A / Malware C 01, 02; 2-grams for Malware B / Malware C
7
Malware Instance Similarity Limitations for post analysis –Lose context given by varied-length sequences 00 01 02 03 04 05 … Malware A Event Codes Initialize network socket Establish connection to malicious.com Load library Sleep … Install a rootkit
8
Malware Instance Similarity Limitations for post analysis –Lose context given by varied-length sequences –Lose commonalities between different types of malware 08 … Malware A 06 04 00 01 … Malware B 00 … Malware C
9
Approach Common Substrings Algorithm –Based on the Longest Common Substring –Finds all common event sequences of minimum (not fixed) length n between trace files in a dataset
10
Approach Malheur Reference Dataset –Dynamic traces of 3131 malware instances Generated with CWSandbox Trace size ranges from 700B to 3.4MB Collected in August 2009
11
Approach Malheur Reference Dataset –Traces split into 2 sets Small Set (<100KB)Large Set (>=100KB) Total # malware instance trace files2,0711,060 Total # events1,217,98517,400,262 Total size of malware instance trace files44 MB490 MB
12
Approach Goal –Reduce redundant analysis, especially in larger malware First, find common substrings within small malware traces Next, reduce analysis workload by removing redundancies in larger malware traces
13
Approach – Common Substrings Algorithm Input: Malware dynamic traces of the small set (size < 100KB) 00 … 01 02 03 … Malware A 04 … 05 06 02 … Malware D 01 … 02 06 02 … Malware B 02 … 03 00 04 … Malware E 04 … 02 03 00 … Malware C 04 … 05 06 00 … Malware F Events Output: Common substrings matrix XXXXXX …XXXXX ……XXXX ………XXX …………XX ……………X ABCDEF A B C D E F All common substrings between Pairs of malware traces
14
Approach – Common Substrings Algorithm Iteration 0 00 01 02 03 … Malware A 01 02 06 02 … Malware B 0001 02 03 01 02 06 02 Malware A Malware B
15
Approach – Common Substrings Algorithm Iteration 1 00 01 02 03 … Malware A 01 02 06 02 … Malware B 01 0001 02 03 01 02 06 02 Malware A Malware B
16
Approach – Common Substrings Algorithm Iteration 2 – match found, merge with upper left corner 01 01,02 0001 02 03 01 02 06 02 Malware A Malware B 00 01 02 03 … Malware A 01 02 06 02 … Malware B
17
Approach – Common Substrings Algorithm Final Iteration 01 01,02 02 0001 02 03 01 02 06 02 Malware A Malware B We have 2 common substrings. We only keep those with minimum substring length 2 00 01 02 03 … Malware A 01 02 06 02 … Malware B
18
Approach – Common Substrings Algorithm Selecting which Common Substrings to keep Common Substrings Matrix 01 01,02 02 0001 02 03 01 02 06 02 Malware A Malware B We have 2 common substrings. We only keep those with minimum substring length 2 XXXXXX 01,02 XXXXX XXXX XXX XX X ABCDEF A B C D E F
19
Approach – Common Substrings Algorithm Unique common substrings are merged XXXXXX 01,02 02,03,04 XXXXX 03,02,24,4 6,35 01,02 02,03,04 XXXX 03,02,20,4 0,35 03,02,20,4 0,3,5 XXX 03,02,24,4 0,36 03,02,20,4 0,3,5 XX 01,02,54,4 09,35 03,02,20,4 0,3,5 X ABCDEF A B C D E F 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings
20
Approach – Common Substrings Algorithm Doesnt that take a lot of space? –Many shared common substrings –Total size of all unique common substrings was 25MB Doesnt that take a lot of processing time? –Can be run on separate processes with multithreading –GPU
21
Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings 00 02 03 … Malware AA 00 01 02 … Malware BB 00 01 03 02 … Malware CC 02 03 … Malware AA 00 … Malware BB 00 01 … Malware CC 40% shared 30% shared 50% shared
22
Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings 00 02 03 … Malware AA 00 01 02 … Malware BB 00 01 03 02 … Malware CC 02 03 … Malware AA 00 … Malware BB 00 01 … Malware CC 40% shared 30% shared 50% shared Average = 40%
23
Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings 00 02 03 … Malware AA 00 01 02 … Malware BB 00 01 03 02 … Malware CC 02 03 … Malware AA 00 … Malware BB 00 01 … Malware CC 40% shared 30% shared 50% shared This process was run several times with minimum length sizes 2 to 100
24
Results Analysts dream: Many long common substrings are shared with the larger set
25
Results A B C A - Not too interesting finding common pairs of instructions is expected and will not reduce redundant analysis by much
26
Results A B C B - Indicates that small traces can be analyzed thus reducing the larger set analysis by about half
27
Results A B C C - Some reassurance that the dataset was reasonably diverse
28
Contributions –The common substring algorithm is capable of identifying similarities in dynamic traces of malware –Redundant event sequences can be identified to reduce analysis –Commonalities are not limited to short event sequences
29
Future Work –Use behavior templates For example: regular expressions to identify a recurring sequences (5 vs. 10 sleep events) –Develop a user interface –Optimization GPU
30
Questions
31
Sample Common Substrings Retrieve file from server and replace system file –Load library –Connect –Download –Check if exists –Remove –Copy –Remove evidence
32
Dataset Reference http://pi1.informatik.uni-mannheim.de/malheur/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.