Machine-Learning Assisted Binary Code Analysis

Slides:



Advertisements
Similar presentations
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Advertisements

Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison
PASTE 2011 Szeged, Hungary September 5, 2011 Labeling Library Functions in Stripped Binaries Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller.
ByteWeight: Learning to Recognize Functions in Binary Code
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
© 2006 Nathan RosenblumMarch 2006Unconventional Code Constructs The New Dyninst Code Parser: Binary Code Isn't as Simple as it Used to Be Nathan Rosenblum.
The Assembly Language Level
ITEC 352 Lecture 27 Memory(4). Review Questions? Cache control –L1/L2  Main memory example –Formulas for hits.
Linear Obfuscation to Combat Symbolic Execution Zhi Wang 1, Jiang Ming 2, Chunfu Jia 1 and Debin Gao 3 1 Nankai University 2 Pennsylvania State University.
Breno de MedeirosFlorida State University Fall 2005 Buffer overflow and stack smashing attacks Principles of application software security.
Previous finals up on the web page use them as practice problems look at them early.
Run time vs. Compile time
Partial Automation of an Integration Reverse Engineering Environment of Binary Code Author : Cristina Cifuentes Reverse Engineering, 1996., Proceedings.
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
David Evans CS201j: Engineering Software University of Virginia Computer Science Lecture 18: 0xCAFEBABE (Java Byte Codes)
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Application Security Tom Chothia Computer Security, Lecture 14.
Introduction to InfoSec – Recitation 2 Nir Krakowski (nirkrako at post.tau.ac.il) Itamar Gilad (itamargi at post.tau.ac.il)
University of Maryland Compiler-Assisted Binary Parsing Tugrul Ince PD Week – 27 March 2012.
Paradyn Project Dyninst/MRNet Users’ Meeting Madison, Wisconsin August 7, 2014 The Evolution of Dyninst in Support of Cyber Security Emily Gember-Jacobson.
Introduction Overview Static analysis Memory analysis Kernel integrity checking Implementation and evaluation Limitations and future work Conclusions.
KEVIN COOGAN, GEN LU, SAUMYA DEBRAY DEPARTMENT OF COMUPUTER SCIENCE UNIVERSITY OF ARIZONA 報告者:張逸文 Deobfuscation of Virtualization- Obfuscated Software.
Bug Localization with Machine Learning Techniques Wujie Zheng
Ether: Malware Analysis via Hardware Virtualization Extensions Author: Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee Presenter: Yi Yang Presenter:
Analysis Of Stripped Binary Code Laune Harris University of Wisconsin – Madison
CS266 Software Reverse Engineering (SRE) Reversing and Patching Java Bytecode Teodoro (Ted) Cipresso,
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
EECS 354 Network Security Reverse Engineering. Introduction Preventing Reverse Engineering Reversing High Level Languages Reversing an ELF Executable.
Executable Unpacking using Dynamic Binary Instrumentation Shubham Bansal (iN3O) Feb 2015 UndoPack 1.
Auther: Kevian A. Roudy and Barton P. Miller Speaker: Chun-Chih Wu Adviser: Pao, Hsing-Kuo.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
CISC Machine Learning for Solving Systems Problems Presented by: Sandeep Dept of Computer & Information Sciences University of Delaware Detection.
Buffer Overflow Proofing of Code Binaries By Ramya Reguramalingam Graduate Student, Computer Science Advisor: Dr. Gopal Gupta.
Microprocessors The ia32 User Instruction Set Jan 31st, 2002.
Buffer Overflow Attack Proofing of Code Binary Gopal Gupta, Parag Doshi, R. Reghuramalingam, Doug Harris The University of Texas at Dallas.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Binary Concolic Execution for Automatic Exploit Generation Todd Frederick.
1 Compiler & its Phases Krishan Kumar Asstt. Prof. (CSE) BPRCE, Gohana.
Introduction to Information Security מרצים : Dr. Eran Tromer: Prof. Avishai Wool: מתרגלים : Itamar Gilad
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 1, 2013 Detecting Code Reuse Attacks Using Dyninst Components Emily Jacobson, Drew.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin May 2-4, 2011 unstrip: Restoring Function Information to Stripped Binaries Using Dyninst Emily.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004 Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2004.
Correct RelocationMarch 20, 2016 Correct Relocation: Do You Trust a Mutated Binary? Drew Bernat
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 12-14, 2010 Paradyn Project Safe and Efficient Instrumentation Andrew Bernat.
October 20-23rd, 2015 FEEBO: A Framework for Empirical Evaluation of Malware Detection Resilience Against Behavior Obfuscation Sebastian Banescu Tobias.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Experience Report: System Log Analysis for Anomaly Detection
Shellcode COSC 480 Presentation Alison Buben.
Learning to Detect and Classify Malicious Executables in the Wild by J
Instruction Set Architecture
Static and dynamic analysis of binaries
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
CS216: Program and Data Representation
System Design.
Introduction to Compilers Tim Teitelbaum
Emily Jacobson and Nathan Rosenblum
GSP 215 Competitive Success-- snaptutorial.com
GSP 215 Education for Service-- snaptutorial.com
GSP 215 Teaching Effectively-- snaptutorial.com
Ramblr Making Reassembly Great Again
Chapter 9 :: Subroutines and Control Abstraction
C Prog. To Object Code text text binary binary Code in files p1.c p2.c
Efficient x86 Instrumentation:
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Outline System architecture Current work Experiments Next Steps
Reverse Engineering for CTFs
Return-to-libc Attacks
Presentation transcript:

Machine-Learning Assisted Binary Code Analysis N. Rosenblum, X. Zhu, B. Miller Computer Sciences Department University of Wisconsin - Madison {nater,jerryzhu,bart}@cs.wisc.edu K. Hunt National Security Agency huntkc@gmail.com

Supporting Static Binary Analysis Binary Analysis is a Foundational Technique for Many Areas Example Uses Why Analyze Binaries? Malware detection Vulnerability analysis Static and Dynamic Instrumentation Formal verification Source code unavailable e.g., malware Source code is inaccurate Compiler transforms structure Provides most accurate representation Code is found through symbol information and parsing MUCH HARDER without symbols Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Many Binaries are Stripped BINARY Stripped binaries lack symbol & debug information Headers EXAMPLES: Code Segment (functions?) Malicious programs Operating system distributions Commercial software packages Legacy codes Data Segment Standard Approach: Parse from entry point Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps Code Segment After static parsing, gap regions remain Indirect (pointer-based) control ambiguity Deliberate calls/branch obfuscation Gaps in code segment may not contain code Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps Code Segment Gap contents may vary .__gmon_start__.libc.so.6.stpcpy.strcpy.__divdi3.printf.stdout.strerror.memmove.getopt_long.re_syntax_options.__ctype_b.getenv.__strtol_internal.getpagesize.re_search_2.memcpy.puts.feof.malloc.optarg.btowc._obstack_newchunk.re_match.__ctype_toupper.__xstat64.abort.strrchr._obstack_begin.calloc.re_set_registers.fprintf. String data Dialog Constants Import names Other strings Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps Code Segment Gap contents may vary 0x8022346 0x802434b 0x80243ad 0x80403d0 0x80503d0 0x8052140 0x8053142 0x806000b 0x802321a 0x8023332 0x804132a 0x8050ca0 Tables or lists of addresses Jump tables Virtual function tables Data objects Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps Code Segment Gap contents may vary gap_funcA { . . . } Code unreachable through standard static parsing gap_funcB { . . . Function pointers Virtual methods Obfuscated calls gap_funcC { . . . } Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Stripped Binaries Exhibit Gaps Code Segment Gap contents may vary 7a 01 00 fd a2 b3 74 68 69 73 20 65 78 61 6d 70 6c 65 20 69 73 20 62 6f 67 75 73 2e 2e 2e But… all of these just look like bytes Every byte in gaps may be the start of a function How can we find code in gaps? Our approach: Use information in known code to model code in gaps Previous work (Vigna et al., 2007) augments parsing with simple instruction frequency information Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Problem reduces to finding function entry points Modeling Binary Code Problem reduces to finding function entry points Task: Classifying every byte in a gap as entry point or non-entry point Two types of features: Content: Idiom features of function entry points Based on instruction sequences Structure: Control flow & conflict features Capture relationship of candidate function entry points Requires joint assignment over all function entry point candidates Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Content-based Features Entry idioms are common patterns at function entry points Idioms are preceding and succeeding instruction sequences with wildcards Candidate For each idiom u, C1 Entry idioms push ebp push ebp|mov esp,ebp push ebp|*|sub esp push ebp|*|mov esp,ebp *|mov_esp,ebp *|sub 0x8,esp *|mov 0x8(ebp),eax PRE nop PRE ret|nop PRE pop ebp|*|nop Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Call Consistency & Overlap Call & conflict features relate candidate FEPs over entire gap Candidates y1 = 1 y3 = -1 y2 = 1 y4 = 1 C1 C2 C3 Don’t be formal: use example subscripts C4 Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Markov Random Field Formalization Joint assignment of yi = {1,-1} for each FEP xi in binary P Unary idiom features fu Weights u trained through logistic regression Binary features fo (overlap), fc (call consistency) Weights o, c large, negative Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Experimental Setup Large set (100’s) of binaries from department Linux servers and Windows workstations Additional binaries compiled with Intel compiler Binaries have full symbol information Model implemented as extensions to Dyninst instrumentation library Strip binary copies and parse to obtain training set Select top idiom features by forward feature selection Perform logistic regression to build idiom model Evaluate model on test data from gap regions in Step 1. Unstripped copies of binaries provide reference set Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Idiom Feature Selection & Training 1. Obtain training data from traditional parse 2. Use Condor HTC to drive forward feature selection on idioms Statically reachable functions … Corpus is hundreds of stripped binaries Features: Feat1 Feat2 Feat3 ... Featk 3. Perform logistic regression on the selected idiom features to obtain model parameters t Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Evaluation Data Sets GNU C Compiler Simple, regular function preamble MS Visual Studio High variation in function entry points Intel C Compiler Most variation in entry points; highly optimized Compiler Programs examined Total Training Examples (pos+neg) Total Test Examples (pos+neg) Actual number of functions in gaps GCC 625 8,412,711 22,806,449 85,870 MS VS 443 8,020,828 11,231,721 70,620 ICC 112 1,364,598 13,169,487 47,841 Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Preliminary Results Comparison of three binary analysis tools: Original Dyninst Scans for common entry preamble IDA Pro Disassembler Scans for common entry preamble List of Library Fingerprints (Windows) Dyninst w/ Model Model replaces entry preamble heuristic Compiler Orig. Dyninst IDA Pro Dyninst w/ Model FP FN GCC 2,833 2,012 14,576 38,074 403 1,860 MS VS 79,320 65,586 9,044 21,491 725 14,143 ICC 3,786 40,195 14,422 26,970 2,337 16,220 Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Classifier Comparisons GCC MSVS ICC Model-based Dyninst extensions outperform vanilla Dyninst and IDA Pro Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Model Component Contributions ICC Test Set Structural information improves classifier accuracy Conflict resolution contributes the most Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

So Far We’ve… Framed stripped binary parsing as a machine learning problem Combined idiom and structural information to consider gap regions as a whole Extended Dyninst with classifier of Function Entry Points in gaps Obtained significant improvement in parsing stripped binaries over existing tools Shown how the HTC approach makes expensive ML techniques tractable for large scale systems Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Future Work: Extensions We’d like precision-recall AUC  1. How? More detailed instruction sequence models (e.g. Hidden Markov Model) Additional information sources (e.g. pointer tables) Caveat: this is where IDA Pro often goes wrong Code provenance First task: identify source compiler (needed to choose appropriate model) Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Future Work: Targets Malicious code Obfuscated code Lots of hand-coded assembly Usually packed (see Kevin Roundy’s talk) Obfuscated code Obfuscation/deobfuscation arms race Signal-based obfuscation is latest salvo Can not trust control flow (e.g. non-returning calls, branch functions, opaque branches) Maybe model block-level structural properties? Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Backup Slides Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis

Tool Performance Comparison Classifier maintains high precision with good recall Model performance highly system-dependent MS Visual Studio & Intel C Compiler FEPs are highly variable Rosenblum, Zhu, Miller, Hunt ML Assisted Binary Code Analysis