Malware Detection XUTONG CHEN & Xin zhou.

Malware Detection XUTONG CHEN & Xin zhou

Effective and Efficient Malware Detection at the End Host
Presenter: Xutong Chen

Background and motivation
Malware is one of the most serious security threats on the Internet today - Serve as a underlying cause of most Internet problems like spam and DOS attack - Have various malicious functions after compromising hosts Functions: screengrab, keylogger, camlogger, even turn off your computer Spam mail DOS attack Keylogger

Characteristic malicious behavior Manifest on system - Conceal from being detected - Replicate Create registry Manifest is a kind of behavior that makes malware can be persistent in your computer. This is to say, it needs to survive rebooting. Sending Benign application injection

Two kinds of approaches - Network-based approach: Manually-crafted signatures loaded into intrusion detection systems Bot detectors - Host-based approach: File hashes and byte signatures Capture the system calls that a specific malware program executes Two kinds of approaches both have severe shortcomings in effectiveness and efficiency.

Overall problem definition and challenges
Previous research has obvious drawbacks - Network-based approach: A malware has many options to render network-based very difficult This approach cannot detect local malicious activities - Host-based approach: State-of-art techniques suffer from ineffective models Can be easily evaded by code polymorphism and system call adjustment Network-based approach: 1.Reason is that network-based approaches don’t capture the actual activities but relies on the artifacts(the traffic) produced by malwares. Malwares can encrypt data they send or change the properties of network traffic. 2.If a malware doesn’t send or receive any traffic from network, it won’t work any more. Host-based approach: 1.Code polymorphism is for static analysis 2.System call adjustment can include techniques of reordering system calls and adding irrelevant system calls in it.

Problem - Propose a both effective and efficient model for general malware detection Challenges - Effectiveness: this model should not be easily evaded by simple code polymorphism or system call reordering techniques - Efficiency: the corresponding detecting system of this model should not increase overhead of the whole operating system to an unaffordable level. Efficiency: I want to highlight this “detecting system” because it only talks about the overhead of detecting system not the signature training system.

Solution: Overall Signature building process - Initial signature graph extraction - Compute argument functions - Optimize functions Signature matching process - Repeatedly activate nodes using information from system monitor until it reaches some important “marks”

Solution: Overall Extract data dependency between system calls.
NtConnectPort … NtRequestWaitReplyPort NtCreateSection NtQueryInformationProcess NtCreateThread NtResumeThread Trace 1: NtUserGetDC NtGdiGetDeviceCaps ⁞ Trace 2: NtGdiCreateCompatibleDC NtGdiBitBlt Extract data dependency between system calls. Build a signature graph for each malware sample based on dependency. This framework is based on the assumption all behavior from a malware can be included in system call traces and they believe data flow within these traces can be used for representing the actual behavior.

Solution: Behavior Graph
New graph model - “Behavior Graph”, an acyclic graph where nodes are system calls and edges are data dependency between system calls

Solution: Detailed Taint Analysis
Taint analysis - Taint every byte outputted by every single system call and find the sinks, the system calls which use those tainted data from previous system call which produces those taint tag Extending Anubis by using taint scope and memory log Detailed taint analysis is proposed to extract real semantics from system call traces

Memory log - Indirect memory access, we need to connect the system call behind the real target memory address Memory log records which instruction was the last one to write a memory address L Syscall B, read Syscall A, the last write Some kind of over tainting but it is needed under this scenerio. We need to be sensitive. A -> B

Taint scope - Taint data are sometimes involved in control flow like conditional jump and loop number control We need to connect those system calls behind control flow

Solution: Argument function
Argument function - Tainted data tracking system is too hard to be deployed on end hosts Consider an input argument 𝑎 of a system call 𝑆 - Its value can be regarded as a function from outputs of previous system call(s) to 𝑆, as following formula: 𝑓 𝑎 : 𝑥 1 , 𝑥 2 ,⋯, 𝑥 𝑛 →𝑦 𝑥 1 , 𝑥 2 ,⋯, 𝑥 𝑛 are values of outputs from some previously related system calls 𝑃 1 , 𝑃 2 ,⋯, 𝑃 𝑛 in a simplest situation, 𝑓 𝑎 :𝑥→𝑦= 1,𝑥>0 0,𝑥≤0 , previous system call 𝑃 and now system call 𝑆 - When we observed that a 𝑃 generate a -1 and a 𝑆 generate a 0 then we believe there is a data flow between this pair of 𝑃 and 𝑆. Replace taint analysis with pre-computed argument function in host deployment

Solution: Argument Function
Dependency establishment - Replace taint analysis(tracking) with checking pre- computed input value - If we have 𝑓 𝑎 1,1 =1, i.e, 𝑝 = 1, 𝑞 = 1 −> 𝑎 = 1, Whenever we first find 𝑃 1 produce a 𝑝 = 1 and P2 produce 𝑞 = 1, we can pre-compute 𝑎 = 1. And If we later find S produce a 𝑎 = 1, we claim we find a data flow between them. q S … 𝑃 1 𝑃 2 a p p Goal part: this is based on the assumption all behavior from a malware can be included in system call traces and they believe data flow within these traces can be used for representing the actual behavior. a

Argument function benefit - We no longer need taint information but only input and output of system calls Reduce a huge amount of overhead q S … 𝑃 1 𝑃 2 a p A B Goal part: this is based on the assumption all behavior from a malware can be included in system call traces and they believe data flow within these traces can be used for representing the actual behavior.

Dynamic program slicing - We need a way to represent the function 𝑓 𝑎 : 𝑥 1 , 𝑥 2 ,⋯, 𝑥 𝑛 →𝑦 for an argument It is always true that a limited number of instructions are involved in calculation of value of 𝑎 - Dynamic program slicing will produce this result for each input argument 𝑎 Basically, procedure of program slicing is a recursively process starting from the instruction including that argument, go backward along the def-use chain of the argument and add those instructions which define this argument and start on them again. The detail of it is the core of the reference paper.

Solution: Function Optimization
Symbolic expression(execution) - Symbolic can help us get rid of instructions and gain a more efficient form of function for execution For those super complex function, we still maintain the instruction slice Apply symbolic expression for faster calculation.

Solution: Matching Process
Matching process - For a single signature graph, firstly mark all nodes as inactivated Every time a new system call S comes, we check for each node which is of the same type of S whether their ancestors are activated and the data dependency is established - The data dependency is checked using argument function - If all ancestors are activated and all data dependency is established, we activate that node. Repeat this procedure for all signature graph

Solution: Interesting Node & “Bottom” node
Heuristics - We need marks for us to stop. That is to say, we need to decide when we believe we get a match Interesting node & “bottom” node - Interesting nodes are those system calls which write to the file system, the registry, the network and start new processes, system services - “bottom” nodes are those system calls which don’t have outgoing edges

Solution: Graph Simplification
Graph simplification - Since we get a matching whenever we reach an interesting node, we ignore all nodes in the subgraph rooted in it Although that node is not in any subgraph of any interesting node, after we remove the part we need to remove, we will find it is of no use as well

Solution: Matching Example
A possible matching status while processing

Solution: Lazy Checking for Complex Functions
Lazy checking for complex functions - We don’t check functions which cannot be represented by a symbolic expression, i.e complex function, until we reach a interesting node or “bottom” node - If a node N fail the complex function checking, we disactivate N along with all nodes within the subgraph rooted in N The reason why we can do this is in practice, they find not that many dependencies are unable to be represented by some symbolic expressions.

Solution: Lazy Checking Example
Deactivate corresponding parts.

Evaluation Dataset

Evaluation Training Dataset Effectiveness

Evaluation Testing Dataset Effectiveness
Low effectiveness reflects some misclassification in ground truth. If only focus on “known” variant, which are variants appearing in training set. This will make up a dataset of 155 samples Effectiveness reaches 0.92.

Evaluation False positive - Low with checking about complex function
Efficiency Run several normal application on system but not raise any false positive. 7-zip benchmark for CPU bound execution 7-zip compress for mixed workload 7-zip archive for IO-bound execution IE for webpage rendering speed Compile: Visual Studio for 67 files, over lines of code. Baseline: run without anything of the detecting system Driver: only log system call parameters Scanner: enable detecting of 44 behavior graphs

Conclusion Contribution - Graph model with enough semantics about the behavior of malware - Super detailed taint analysis - Good effectiveness, efficiency and scalability Limitation - Model the whole behavior of malware but not focus on function semantics - Semantic gap between the training system and the detecting system - False positive problem

Inspector Gadget: Automated Extraction of Proprietary Gadgets from Malware Binaries
Presenter: Xin zhou

Analyzing programs is hard, analyzing malware binaries is harder - Source code is unlikely to be available - Obfuscation makes binary code resistant against static analysis - Thousands of variations for a family of malware Conclusion: Dynamic analysis is more practical targeting malwares

Drawbacks of dynamic analysis

Malicious behavior might dependent on - Analysis date & time - Analysis environment (e.g., username, host OS, …) - Availability of remote resources (e.g., C&C hosts) Needs to be repeatedly performed on single sample - At different points in time - Preferably on different systems - Even more time/resource consuming

Recall: Remote Access Trojan (RAT) - RAT is a core component in an APT attack - A complex set of potentially harmful functions (PHFs): keylogger, screengrab, remote desktop, remote shell, etc. Each PHF is implemented as a Gadget

Malwares dissected by gadgets - Repeated execution of gadgets may not be that harmful - Simplify replay environments - Automated extraction becomes possible

Implementation Source and sink - Find sinks first (manually defined behaviors of interest) - Map selected behavior to analyzed process & thread, API accesses and control flow - Find and suggest data manipulating instructions after chosen API call. Possibly refine chosen position to include the data processing

Implementation Computer aided analysis - Anubis used for taint analysis - Generates instruction log and flow log - Human analyst reads reports and chooses points of interest

Implementation Backward slicing - Recursing on taint labels consumed by API calls - Computes closures: all relevant code and data is included recursively - Extracted code can be run in a self-contained fashion

Implementation Backward slicing

Implementation Forward searching - Heuristics for detecting endpoints - String handling instructions may signal end of computation, example: For encoded URLs, string comparison functions might be used once the URLs have been decoded - Mathematical instructions which indicate cryptographic activity

Implementation Gadget extraction - Gadgets are extracted as standalone DLLs - Other applications invoke DLLs by offering environment hooks

Implementation Host OS accesses mediation: environment hooks - every system / API call is redirected to the gadget player (using a multiplexor function) - player has the possibility to sanitize and/or manipulate call parameters - if player decides to allow the API invocation, call and parameters are forwarded to the actual implementation (e.g., inside a Windows library)

Implementation Gadget player Confine gadget execution - handle crashes (e.g., possible, invalid memory accesses) - one possibility: code emulation - here: separate, monitored thread with signal handling Mediate accesses to the host OS - gadgets are guaranteed to contain no calls to system or API functionality directly - each access is done through environment hooks

Implementation Gadget inversion - Use gadgets as transformation oracle - Determine input using known output data - Effective against simple encoding / weak encryption, where one output byte correspond to not too many input bytes

Evaluation: Conficker
Generates (pseudo) random domain names upon startup Time fetched from remote site (e.g., msn.com) controls domain generation randomization seed Randomly selected domain name is used for contacting C&C host

Result from manual analysis

Gadget - Start extraction (slicing) from invocation of DnsQuery_W - Extracts complete Domain Generation Algorithm (DGA) - See one domain on query invocation - Find all domains on gadget heap

Result from automated analysis by Inspector

Evaluation: Cutwail Generates spam s from templates downloaded from remote C&C hosts Communication employs proprietary encryption algorithm Template is not stored on file system - content decrypted and handled solely in memory

Evaluation: Cutwail

Evaluation: Cutwail Gadget: - Inspect download behavior - Start extraction after download is complete - Inspector suggests to automatically refine extraction starting-point to end-of-decryption - Extract complete template download & decryption algorithm

Evaluation: Cutwail

Limitations Context information maybe incomplete - Force execution along known routes Behaviors triggered conditionally - Inspector is not able to capture such behaviors at all Resistance to dynamic analysis environment (Anubis is based on QEMU) Taint analysis evasion Works only for single thread malwares

Conclusion Dynamic analysis is resource consuming, results are cluttered and limited to temporary snapshots of malicious behavior Inspector allows to automatically extract behavior into standalone gadgets Gadgets can be reused in many scenarios and - Enhance information extraction - Simplify repeated analysis of behavior Evaluation shows that extraction is applicable to real world, malicious programs

Malware Detection XUTONG CHEN & Xin zhou.

Similar presentations

Presentation on theme: "Malware Detection XUTONG CHEN & Xin zhou."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Malware Detection XUTONG CHEN & Xin zhou.

Similar presentations

Presentation on theme: "Malware Detection XUTONG CHEN & Xin zhou."— Presentation transcript:

Similar presentations

About project

Feedback