U.S. Army Research, Development and Engineering Command Jaime C. Acosta, Ph.D. Using the Longest Common Substring on Dynamic Traces of Malware to Automatically Identify Common Behaviors
Research Question Can we reduce redundant analysis by finding common behaviors in malware instances?
Malware Analysis Dynamic Analysis: Run the malware instance (binary) in a controlled environment –Log all events (registry, memory, sockets, etc.) –Analyze logs for malicious behavior –Find similar malware instances based on runtime behavior
Malware Analysis Event Logs … Malware A Event Codes Initialize network socket Establish connection to malicious.com Load library Sleep
Malware Instance Similarity Event n-grams (Rieck et al. 2010) –Find common n-grams (or sequences of events) in event logs 01, 02; 02, 03; 2-grams for Malware A / Malware B … Malware A … Malware B … Malware C 01, 02; 2-grams for Malware A / Malware C 01, 02; 2-grams for Malware B / Malware C Events Codes
Malware Instance Similarity Event n-grams (Rieck 2010) –Find common fixed size n-grams (or sequences of events) in event logs Malware A / Malware B are more likely to be of the same type 01, 02; 02, 03; 2-grams for Malware A / Malware B 00 … … Malware A 01 … … Malware B 04 … … Malware C 01, 02; 2-grams for Malware A / Malware C 01, 02; 2-grams for Malware B / Malware C
Malware Instance Similarity Limitations for post analysis –Lose context given by varied-length sequences … Malware A Event Codes Initialize network socket Establish connection to malicious.com Load library Sleep … Install a rootkit
Malware Instance Similarity Limitations for post analysis –Lose context given by varied-length sequences –Lose commonalities between different types of malware 08 … Malware A … Malware B 00 … Malware C
Approach Common Substrings Algorithm –Based on the Longest Common Substring –Finds all common event sequences of minimum (not fixed) length n between trace files in a dataset
Approach Malheur Reference Dataset –Dynamic traces of 3131 malware instances Generated with CWSandbox Trace size ranges from 700B to 3.4MB Collected in August 2009
Approach Malheur Reference Dataset –Traces split into 2 sets Small Set (<100KB)Large Set (>=100KB) Total # malware instance trace files2,0711,060 Total # events1,217,98517,400,262 Total size of malware instance trace files44 MB490 MB
Approach Goal –Reduce redundant analysis, especially in larger malware First, find common substrings within small malware traces Next, reduce analysis workload by removing redundancies in larger malware traces
Approach – Common Substrings Algorithm Input: Malware dynamic traces of the small set (size < 100KB) 00 … … Malware A 04 … … Malware D 01 … … Malware B 02 … … Malware E 04 … … Malware C 04 … … Malware F Events Output: Common substrings matrix XXXXXX …XXXXX ……XXXX ………XXX …………XX ……………X ABCDEF A B C D E F All common substrings between Pairs of malware traces
Approach – Common Substrings Algorithm Iteration … Malware A … Malware B Malware A Malware B
Approach – Common Substrings Algorithm Iteration … Malware A … Malware B Malware A Malware B
Approach – Common Substrings Algorithm Iteration 2 – match found, merge with upper left corner 01 01, Malware A Malware B … Malware A … Malware B
Approach – Common Substrings Algorithm Final Iteration 01 01, Malware A Malware B We have 2 common substrings. We only keep those with minimum substring length … Malware A … Malware B
Approach – Common Substrings Algorithm Selecting which Common Substrings to keep Common Substrings Matrix 01 01, Malware A Malware B We have 2 common substrings. We only keep those with minimum substring length 2 XXXXXX 01,02 XXXXX XXXX XXX XX X ABCDEF A B C D E F
Approach – Common Substrings Algorithm Unique common substrings are merged XXXXXX 01,02 02,03,04 XXXXX 03,02,24,4 6,35 01,02 02,03,04 XXXX 03,02,20,4 0,35 03,02,20,4 0,3,5 XXX 03,02,24,4 0,36 03,02,20,4 0,3,5 XX 01,02,54,4 09,35 03,02,20,4 0,3,5 X ABCDEF A B C D E F 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings
Approach – Common Substrings Algorithm Doesnt that take a lot of space? –Many shared common substrings –Total size of all unique common substrings was 25MB Doesnt that take a lot of processing time? –Can be run on separate processes with multithreading –GPU
Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings … Malware AA … Malware BB … Malware CC … Malware AA 00 … Malware BB … Malware CC 40% shared 30% shared 50% shared
Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings … Malware AA … Malware BB … Malware CC … Malware AA 00 … Malware BB … Malware CC 40% shared 30% shared 50% shared Average = 40%
Approach Find and remove common substrings in large set (size >= 100KB) 03,02,20,40,35; 03,02,02,02,03; 01,02,02; 00,02; 03,02; … Small set (<100KB) common substrings … Malware AA … Malware BB … Malware CC … Malware AA 00 … Malware BB … Malware CC 40% shared 30% shared 50% shared This process was run several times with minimum length sizes 2 to 100
Results Analysts dream: Many long common substrings are shared with the larger set
Results A B C A - Not too interesting finding common pairs of instructions is expected and will not reduce redundant analysis by much
Results A B C B - Indicates that small traces can be analyzed thus reducing the larger set analysis by about half
Results A B C C - Some reassurance that the dataset was reasonably diverse
Contributions –The common substring algorithm is capable of identifying similarities in dynamic traces of malware –Redundant event sequences can be identified to reduce analysis –Commonalities are not limited to short event sequences
Future Work –Use behavior templates For example: regular expressions to identify a recurring sequences (5 vs. 10 sleep events) –Develop a user interface –Optimization GPU
Questions
Sample Common Substrings Retrieve file from server and replace system file –Load library –Connect –Download –Check if exists –Remove –Copy –Remove evidence
Dataset Reference