Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania.

Slides:



Advertisements
Similar presentations
Static Single-Assignment ? ? Introduction: Over last few years [1991] SSA has been Stablished as… Intermediate program representation.
Advertisements

Data-Flow Analysis II CS 671 March 13, CS 671 – Spring Data-Flow Analysis Gather conservative, approximate information about what a program.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Course Outline Traditional Static Program Analysis Software Testing
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
Compiler techniques for exposing ILP
1 Chapter 8: Code Generation. 2 Generating Instructions from Three-address Code Example: D = (A*B)+C =* A B T1 =+ T1 C T2 = T2 D.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
C Chuen-Liang Chen, NTUCS&IE / 321 OPTIMIZATION Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 3 Memory Management Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Rational Apex 4.0 Optimization “Beware the benchmark!”
1 CS 201 Compiler Construction Lecture 7 Code Optimizations: Partial Redundancy Elimination.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Partial Redundancy Elimination Guo, Yao.
Chapter 9 Subprogram Control Consider program as a tree- –Each parent calls (transfers control to) child –Parent resumes when child completes –Copy rule.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
OS Fall’02 Memory Management Operating Systems Fall 2002.
CS 536 Spring Global Optimizations Lecture 23.
Peephole Optimization Final pass over generated code: examine a few consecutive instructions: 2 to 4 See if an obvious replacement is possible: store/load.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Program Design and Development
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Lecture 25 Generating Code for Basic Blocks Topics Code Generation Readings: April 19, 2006 CSCE 531 Compiler Construction.
Intermediate Code. Local Optimizations
Topic 6 -Code Generation Dr. William A. Maniatty Assistant Prof. Dept. of Computer Science University At Albany CSI 511 Programming Languages and Systems.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Inline Function. 2 Expanded in a line when it is invoked Ie compiler replace the function call with function code To make a function inline the function.
Optimizing Compilers Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Optimization software for apeNEXT Max Lukyanov,  apeNEXT : a VLIW architecture  Optimization basics  Software optimizer for apeNEXT  Current.
Operator Precedence First the contents of all parentheses are evaluated beginning with the innermost set of parenthesis. Second all multiplications, divisions,
Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine. Interpreter of the virtual machine is invoked to execute the.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
1 Code Generation Part II Chapter 8 (1 st ed. Ch.9) COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
1 Code Generation Part II Chapter 9 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University, 2005.
CPSC 388 – Compiler Design and Construction Optimization.
RIVERSIDE RESEARCH INSTITUTE Deobfuscator: An Automated Approach to the Identification and Removal of Code Obfuscation Eric Laspe, Reverse Engineer Jason.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
 Control Flow statements ◦ Selection statements ◦ Iteration statements ◦ Jump statements.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Optimization Simone Campanoni
Code Optimization Data Flow Analysis. Data Flow Analysis (DFA)  General framework  Can be used for various optimization goals  Some terms  Basic block.
Code Optimization Overview and Examples
Lecture 5 Partial Redundancy Elimination
Optimization Code Optimization ©SoftMoore Consulting.
Topic 10: Dataflow Analysis
Code Generation Part III
Optimizing Transformations Hal Perkins Autumn 2011
Optimizing Transformations Hal Perkins Winter 2008
CSC215 Lecture Flow Control.
Code Optimization Overview and Examples Control Flow Graph
Code Generation Part III
Optimizations using SSA
Fall Compiler Principles Lecture 6: Dataflow & Optimizations 1
8 Code Generation Topics A simple code generator algorithm
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Code Generation Part II
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
CSC215 Lecture Control Flow.
Presentation transcript:

Discovering Similarity of Short Programs by Canonical Form Baohua Wu University of Pennsylvania

Scenario With a known malicious program P1 about a security hole, and an unknown suspicious program P2, how to identify the similarity of P2 to P1? If there are known polymorphic malicious program P1, P2, … Pn, how to identify their common “fingerprints”?

Assumption Malicious programs are short in size, for example –Scripts < 500 lines –Assembly code < 10 kilobytes

Obfuscation Techniques Dead-Code Insertion –NOP, CLI, STI, etc –Complicated ones: inc/dec, push/pop Code Transposition –Add (unconditional) branches –Reorder independent instructions

Obfuscation Techniques Register Reassignment –Replace eax with ebx if ebx is unused in a live range –Prologue/epilogue code to swap registers Instruction Substitution –IA32 instruction set has many equivalent instructions

Obfuscation Techniques Data modification –Replace a boolean variable with two integers X  a < b Encryption –Polymorph Engine –Variable keys, algorithms, decriptors

Obfuscation Summary Changing instructions inside a basic block Changing control flows Dynamic code generation How to solve them?

Objective of Canonical Form of Programs Reducing polymorphism Identifying tokens for statistic analysis

Canonical Form of Programs Compact intermediate instructions –No or few alternative instructions Simplified programming model –Code segment – read only –Data segment – heap only (no stack, no registers) –No function calls except system calls –Conditional and loop instructions are kept

More about Canonical Form Encrypted code are processed in advance –Multiple phases of compilation –Or simply report it as suspicious No user-defined function calls –Recursive function elimination –Inline function expansion Code optimization by compiler techniques –no dead or useless code –No or few redundant common expressions

More about Canonical Form For assembly program, treat registers as variables –No limitation on number of registers –No unnecessary swapping instructions Rename variables in some Total Order (v1,v2…) –Definition position in the program is a total order But it may be changed in polymorphism –Main order by data dependency –Secondary order by variable type, length, name, def position Reorder interexchangeable instructions by alphabetic order

What else for polymorphism? Changes in algorithm –Not in my scope… Changes in control flow –Unconditional branch insertion –Combination of conditional branches –Exchanging internal and external loop –Useless branches

Unconditional branch insertion A; B; C; goto 3; 1: C; goto 4; 2: B; goto 1; 3: A; goto 2; 4:

Combination of conditional branches If a < b Then A; Else B; If c < d Then C; Else D; If a < b and c < d Then A; C; Else if a =d Then A; D; Else if a>=b and c<d Then B; C; Else B; D;

Exchanging internal and external loop Sum(matrix a) For (i=0;i<10;i++) For (j=0;j<10;j++) sum+= a[i][j]; Sum(matrix a) For (j=0;j<10;j++) For (i=0;i<10;i++) sum+= a[i][j];

Useless branches A; B; C;. End: D; A; If date<1900 Goto End; B; C;. End: D;

Linearizing Control Flow …So far, no semantics is lost. Now it is different! Remove backward branches –Replace them (such as a loop) with repetitive conditional statements –Number of repetitions is set to N (ex. 2) Remove forward branches by enumerating possible combinations of executed branches Further change each path into canonical form CPS -- Canonical Path Set –Critical Canonical Path in CPS is a sub-path of a actual execution path causing damage

Similarity of Canonical Programs P1 is a known malicious program P2 is an unknown program Similarity(P1, P2) =

PathSim: Similarity of Canonical Paths Recall in canonical paths –Linear execution –No control flow –No redundant common expression –No useless code –No dead code –No registers –Variables are renamed by some total order –Independent instructions are sorted in alphabetic order Similarity algorithms for text documents can be used

Identifying Critical Canonical Path (CCP) P1, P2, P3, … Pn are known malicious programs A CCP must have at least one similar path in all Canonical Path Sets CPS(P1), CPS(P2), … CPS(Pn) Statistic algorithms can be applied, ex. Gibbs Sampler

Summary Assumption: malicous programs are short Canonical form for comparison Limited number of canonical linear paths Similarity problem for text documents Statistic methods to identify common fingerprints

Acknowledgement Thank You All!