AccessMiner Using System- Centric Models for Malware Protection Andrea Lanzi, Davide Balzarotti, Christopher Kruegel, Mihai Christodorescu and Engin Kirda ACM CCS 2010 Oct. 1
OUTLINE Malware Detection System Call Data Collection Program-Centric Models and Detection System-Centric Models and Detection Discussion and Conclusion 2
OUTLINE Malware Detection System Call Data Collection Program-Centric Models and Detection System-Centric Models and Detection Discussion and Conclusion 3
Malware Detection Signature ◦ Static content ◦ Byte strings, instruction sequences =>Code obfuscation Behavior ◦ Dynamic actions ◦ Sequences of System calls, API functions ◦ A program-centric approach ◦ …good results? 4
Malware Detection Problem Test case ◦ Small scale About 10 benign applications ◦ Limited execution A few minutes, sandbox ◦ Synthetic inputs ◦ Single machine 5
Malware Detection Problem(cont.) Program-centric model ◦ Narrow view on a program ◦ Diversity of system call information ◦ How benign programs interact with their environment? ◦ Their models may specific to a small set of benign applications only 6
OUTLINE Malware Detection System Call Data Collection Program-Centric Models and Detection System-Centric Models and Detection Discussion and Conclusion 7
System Call Data Collection A Microsoft Windows kernel module ◦ Collect, anonymize, and upload system call logs ◦ Hooks the System Services Descriptor Table ◦ Mindful of system resource 8
Kernel collector 79 different system calls ◦ Related to files, regs, processes and threads, networking, memory. ◦ Same subset in Anubis 9
System Call Data Sensitive data are replaced ◦ Non-system paths, user-root registry key, IP addresses 10
System Call Data Collection Large and diverse set of system call traces ◦ Ten different machines, different users ◦ Serveral weeks ◦ 114.5GB of data ◦ billion system call ◦ 362,600 processes ◦ 242 applications 11
Data set 2~4 days with 2~12 hours Production systems, development systems 12
Data Normalization Raw data(system call logs) =>Accessed resources and access type Tracking the access operations ◦ The set of resources open at any given time OS handles ◦ Until the resource is released(NtClose) Execution path and file name: ◦ NtOpenFile, NtCreateSection, NtCreateThread 13
OUTLINE Malware Detection System Call Data Collection Program-Centric Models and Detection System-Centric Models and Detection Discussion and Conclusion 14
Analysis of System Call Data How diverse is the collected system call data? Focus on types ◦ Long tradition in the security community ◦ Most models rely upon characteristic patterns Ignore argument values 15
Creating n-gram Models Follow a ” standard ” approach 1.Extract n-grams Models for a set of malware programs and a set of benign programs 2.Find all n-grams appear in malware programs but not in benign programs 3.Hope those n-grams are characteristic for malware programs 16
Unique n-gram analysis 17
n-gram Models 10,838 malware samples from Anubis Ten experiments(ten machines) ◦ System call traces from 9 machines and 2/3 of the malware set to train an n-grams ◦ Perform detection with remaining system calls traces and 1/3 malwares 18
Detection Results 19
Program-Centric Models and Detection Since system-call sequences invoked by benign applications are diverse ◦ Have difficulties in distingushing normal and malicious behaviors A large amount of data is needed 20
OUTLINE Malware Detection System Call Data Collection Program-Centric Models and Detection System-Centric Models and Detection Discussion and Conclusion 21
System-Centric Models and Detection Generalize how benign programs interact with the operating system Record the files and the registry entries ◦ Read, write, execute It is “ convergence ” 22
Access Activity Model A set of labels for operating system resources A label “L” is a set of access tokens ◦ {t 0,t 1,…,t n } A token “t” is a pair ◦, a => application op => type of access 23
Initial Access Activity Model(1) Use system-call traces of all benign processes A virtual file system tree Application “a” C:\foo\a.txt (write) Application “b” C:\foo\bar\b.rar (exec) 24
Model Pre-processing(2) Remove some elements in the tree ◦ Microsoft Windows services ◦ Desktop indexing programs ◦ Anti-virus software Identify applications that start processes with different names ◦ C:\Windows\system32 => win_core 25
Model Generalization(3) Propagated Container ◦ All children are private(without *) ◦ C:\Program Files Merged => 26
System-Centric Model Detection For any op Find the longest prefix P shared between the path to the resource and the folders in the virtual tree stored by our model Ten experiments ◦ File system access activity model About 100 labels ◦ Registry access activity model About 3000 labels ◦ Full access activity model 27
Detection Results(Files) //Looks sobering Many samples(Malware) don ’ t work(!) ◦ 10,838 -> 7,847 Use only write operation ◦ Our own logging component ◦ Software updates 28
Detection Results(Regs) 29 HKEY_USER\Software\Microsoft ◦ Need a larger training set
OUTLINE Malware Detection System Call Data Collection Program-Centric Models and Detection System-Centric Models and Detection Discussion and Conclusion 30
Discussion and Conclusion Full access activity model ◦ 91% detection / 0% false positives System-centric approach Policy violations occurred only for few, specific classes of programs Network limitation MAC policy ◦ SELinux 31