Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Analysis Tevfik Bultan Verification Lab

Similar presentations


Presentation on theme: "String Analysis Tevfik Bultan Verification Lab"— Presentation transcript:

1 String Analysis Tevfik Bultan Verification Lab
Department of Computer Science University of California, Santa Barbara

2 Web software Web software is becoming increasingly dominant
Web applications are used extensively in many areas: Commerce: online banking, online shopping, … Entertainment: online music & videos, … Interaction: social networks We will rely on web applications more in the future: Health records Google Health, Microsoft HealthVault Controlling and monitoring of national infrastructures: Google Powermeter Web software is also rapidly replacing desktop applications Could computing + software-as-service Google Docs, Google …

3 One Major Road Block Web applications are not trustworthy!
Web applications are notorious for security vulnerabilities Their global accessibility makes them a target for many malicious users As web applications are becoming increasingly dominant and as their use in safety critical areas is increasing their trustworthiness is becoming a critical issue

4 Web applications are not secure
There are many well-known security vulnerabilities that exist in many web applications. Here are some examples: Malicious file execution: where a malicious user causes the server to execute malicious code SQL injection: where a malicious user executes SQL commands on the back-end database by providing specially formatted input Cross site scripting (XSS): causes the attacker to execute a malicious script at a user’s browser These vulnerabilities are typically due to errors in user input validation or lack of user input validation

5 Web Application Vulnerabilities

6 Why are web applications error prone?
Extensive string manipulation: Web applications use extensive string manipulation To construct html pages, to construct database queries in SQL, etc. The user input comes in string form and must be validated and sanitized before it can be used This requires the use of complex string manipulation functions such as string-replace String manipulation is error prone

7 Web Application Vulnerabilities
The top two vulnerabilities of the Open Web Application Security Project (OWASP)’s top ten list in 2007 Cross Site Scripting (XSS) Injection Flaws (such as SQL Injection) The top two vulnerabilities of the OWASPs top ten list in 2010

8 String Related Vulnerabilities
String related web application vulnerabilities occur when: a sensitive function is passed a malicious string input from the user This input contains an attack It is not properly sanitized before it reaches the sensitive function String analysis: Discover these vulnerabilities automatically

9 XSS Vulnerability A PHP Example: <script ...
2: $www = $_GET[”www”]; 3: $l_otherinfo = ”URL”; 4: echo ”<td>” . $l_otherinfo . ”: ” . $www . ”</td>”; 5:?> The echo statement in line 4 is a sensitive function It contains a Cross Site Scripting (XSS) vulnerability <script ...

10 Is it Vulnerable? A simple taint analysis, e.g., [Huang et al. WWW04], can report this segment vulnerable using taint propagation 1:<?php 2: $www = $_GET[”www”]; 3: $l_otherinfo = ”URL”; 4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”; 5:?> echo is tainted → script is vulnerable tainted

11 How to Fix it? To fix the vulnerability we added a sanitization routine at line s Taint analysis will assume that $www is untainted and report that the segment is NOT vulnerable 1:<?php 2: $www = $_GET[”www”]; 3: $l_otherinfo = ”URL”; s: $www = ereg_replace(”[^A-Za-z0-9 4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”; 5:?> tainted untainted

12 Is It Really Sanitized? <script ...
1:<?php 2: $www = $_GET[”www”]; 3: $l_otherinfo = ”URL”; s: $www = ereg_replace(”[^A-Za-z0-9 4: echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”; 5:?> <script ...

13 Sanitization Routines are Erroneous
The sanitization statement is not correct! ereg_replace(”[^A-Za-z0-9 Removes all characters that are not in { A-Za-z0-9 } denotes all characters between “.” and (including “<” and “>”) should be This example is from a buggy sanitization routine used in MyEasyMarket-4.1 (line 218 in file trans.php)

14 String Analysis String analysis determines all possible values that a string expression can take during any program execution Using string analysis we can identify all possible input values of the sensitive functions Then we can check if inputs of sensitive functions can contain attack strings How can we characterize attack strings? Use regular expressions to specify the attack patterns Attack pattern for XSS: Σ∗<scriptΣ∗

15 String Analysis Σ∗<α∗sα∗cα∗rα∗iα∗pα∗tΣ∗
If string analysis determines that the intersection of the attack pattern and possible inputs of the sensitive function is empty then we can conclude that the program is secure If the intersection is not empty, then we can again use string analysis to generate a vulnerability signature characterizes all malicious inputs Given Σ∗<scriptΣ∗ as an attack pattern: The vulnerability signature for $_GET[”www”] is Σ∗<α∗sα∗cα∗rα∗iα∗pα∗tΣ∗ where α∉ { A-Za-z0-9 }

16 Vulnerabilities can be tricky
Input <!sc+rip!t ...> does not match the attack pattern but it matches the vulnerability signature and it can cause an attack 1:<?php 2: $www = <!sc+rip!t ...>; 3: $l otherinfo = ”URL”; s: $www = ereg replace(”[^A-Za-z0-9 <!sc+rip!t...>); 4: echo ”<td>” . $l otherinfo . ”: ” . <script ...> .”</td>”; 5:?>

17 Automata-based String Analysis
Finite State Automata can be used to characterize sets of string values We use automata based string analysis Associate each string expression in the program with an automaton The automaton accepts an over approximation of all possible values that the string expression can take during program execution Using this automata representation we symbolically execute the program, only paying attention to string manipulation operations

18 String Analysis Stages
Convert PHP programs to dependency graphs Combine symbolic forward and backward symbolic reachability analyses Forward analysis Assume that the user input can be any string Propagate this information on the dependency graph When a sensitive function is reached, intersect with attack pattern Backward analysis If the intersection is not empty, propagate the result backwards to identify which inputs can cause an attack Front End Forward Analysis Backward Analysis PHP Program Vulnerability Signatures Attack patterns

19 Dependency Graphs Given a PHP program, first construct the:
2: $www = $ GET[”www”]; 3: $l_otherinfo = ”URL”; 4: $www = ereg_replace( ”[^A-Za-z0-9 ); 5: echo $l_otherinfo . ”: ” .$www; 6:?> $_GET[www], 2 “URL”, 3 [^A-Za-z0-9 4 “”, 4 $www, 2 $l_otherinfo, 3 “: “, 5 preg_replace, 4 str_concat, 5 $www, 4 str_concat, 5 echo, 5 Dependency Graph

20 Forward Analysis Using the dependency graph we conduct vulnerability analysis Automata-based forward symbolic analysis that identifies the possible values of each node Each node in the dependency graph is associated with a DFA DFA accepts an over-approximation of the strings values that the string expression represented by that node can take at runtime The DFAs for the input nodes accept Σ∗ Intersecting the DFA for the sink nodes with the DFA for the attack pattern identifies the vulnerabilities Uses post-image computations of string operations: postConcat(M1, M2) returns M, where M=M1.M2 postReplace(M1, M2, M3) returns M, where M=replace(M1, M2, M3)

21 L(URL: [A-Za-z0-9 .-;=-@/]*<[A-Za-z0-9 .-@/]*)
Forward Analysis Forward = Σ* Attack Pattern = Σ*<Σ* $_GET[www], 2 [^A-Za-z0-9 4 “”, 4 “URL”, 3 $www, 2 Forward = URL Forward = [^A-Za-z0-9 Forward = ε Forward = Σ* “: “, 5 preg_replace, 4 $l_otherinfo, 3 Forward = : Forward = URL Forward = [A-Za-z0-9 str_concat, 5 $www, 4 Forward = URL: Forward = [A-Za-z0-9 str_concat, 5 Forward = URL: [A-Za-z0-9 echo, 5 L(Σ*<Σ*) Forward = URL: [A-Za-z0-9 L(URL: [A-Za-z0-9 = L(URL: [A-Za-z0-9 ≠ Ø

22 Symbolic Automata Representation
We used the MONA DFA Package for automata manipulation [Klarlund and Møller, 2001] Compact Representation: Canonical form and Shared BDD nodes Efficient MBDD Manipulations: Union, Intersection, and Emptiness Checking Projection and Minimization Cannot Handle Nondeterminism: We used dummy bits to encode nondeterminism

23 Intersection Result Automaton
: [A-Za-z0-9 [A-Za-z0-9 Space < URL: [A-Za-z0-9

24 Widening String verification problem is undecidable
The forward fixpoint computation is not guaranteed to converge in the presence of loops and recursion We compute a sound approximation During fixpoint we compute an over approximation of the least fixpoint that corresponds to the reachable states We use an automata based widening operation to over-approximate the fixpoint Widening operation over-approximates the union operations and accelerates the convergence of the fixpoint computation

25 Backward Analysis A vulnerability signature is a characterization of all malicious inputs that can be used to generate attack strings We identify vulnerability signatures using an automata-based backward symbolic analysis starting from the sink node Uses pre-image computations on string operations: preConcatPrefix(M, M2) returns M1 and where M = M1.M2 preConcatSuffix(M, M1) returns M2, where M = M1.M2 preReplace(M, M2, M3) returns M1, where M=replace(M1, M2, M3)

26 Vulnerability Signature = [^<]*<Σ*
Backward Analysis Forward = Σ* Backward = [^<]*<Σ* $_GET[www], 2 node 3 node 6 “URL”, 3 [^A-Za-z0-9 4 “”, 4 $www, 2 Forward = URL Forward = [^A-Za-z0-9 Forward = ε Forward = Σ* Backward = Do not care Backward = Do not care Backward = Do not care Backward = [^<]*<Σ* Vulnerability Signature = [^<]*<Σ* “: “, 5 preg_replace, 4 $l_otherinfo, 3 Forward = : Forward = [A-Za-z0-9 Forward = URL Backward = Do not care Backward = [A-Za-z0-9 Backward = Do not care node 10 str_concat, 5 $www, 4 Forward = URL: Forward = [A-Za-z0-9 node 11 Backward = Do not care Backward = [A-Za-z0-9 str_concat, 5 Forward = URL: [A-Za-z0-9 Backward = URL: [A-Za-z0-9 node 12 echo, 5 Forward = URL: [A-Za-z0-9 Backward = URL: [A-Za-z0-9

27 Vulnerability Signature Automaton
Σ [^<] < Special internal chars [^<]*<Σ*

28 Vulnerability Signatures
The vulnerability signature is the result of the input node, which includes all possible malicious inputs An input that does not match this signature cannot exploit the vulnerability After generating the vulnerability signature Can we generate a patch based on the vulnerability signature? The vulnerability signature automaton for the running example < Σ [^<]

29 Patches from Vulnerability Signatures
Main idea: Given a vulnerability signature automaton, find a cut that separates initial and accepting states Remove the characters in the cut from the user input to sanitize This means, that if we just delete “<“ from the user input, then the vulnerability can be removed < Σ [^<] min-cut is {<}

30 Patches from Vulnerability Signatures
Ideally, we want to modify the input (as little as possible) so that it does not match the vulnerability signature Given a DFA, an alphabet cut is a set of characters that after ”removing” the edges that are associated with the characters in the set, the modified DFA does not accept any non-empty string Finding a minimal alphabet cut of a DFA is an NP-hard problem (one can reduce the vertex cover problem to this problem) We use a min-cut algorithm instead The set of characters that are associated with the edges of the min cut is an alphabet cut but not necessarily the minimum alphabet cut

31 Automatically Generated Patch
Automatically generated patch will make sure that no string that matches the attack pattern reaches the sensitive function <?php if (preg match(’/[^ <]*<.*/’,$ GET[”www”])) $ GET[”www”] = preg replace(<,””,$ GET[”www”]); $www = $_GET[”www”]; $l_otherinfo = ”URL”; $www = ereg_replace(”[^A-Za-z0-9 echo ”<td>” . $l_otherinfo . ”: ” .$www. ”</td>”; ?>

32 Experiments Σ∗<scriptΣ∗
We evaluated our approach on five vulnerabilities from three open source web applications: MyEasyMarket-4.1: A shopping cart program (2) BloggIT-1.0: A blog engine (3) proManager-0.72: A project management system We used the following XSS attack pattern: Σ∗<scriptΣ∗

33 Forward Analysis Results
The dependency graphs of these benchmarks are simplified based on the sinks Unrelated parts are removed using slicing Input Results #nodes #edges #sinks #inputs Time(s) Mem (kb) #states/#bdds 21 20 1 0.08 2599 23/219 29 0.53 13633 48/495 25 2 0.12 1955 125/1200 23 22 4022 133/1222 3387

34 Backward Analysis Results
We use the backward analysis to generate the vulnerability signatures Backward analysis starts from the vulnerable sinks identified during forward analysis Input Results #nodes #edges #sinks #inputs Time(s) Mem (kb) #states/#bdds 21 20 1 0.46 2963 9/199 29 41.03 811/8389 25 2 2.35 5673 20/302, 20/302 23 22 2.33 32035 91/1127 5.02 14958

35 Alphabet Cuts We generate alphabet cuts from the vulnerability signatures using a min-cut algorithm Problem: When there are two user inputs the patch will block everything and delete everything Overlooks the relations among input variables (e.g., the concatenation of two inputs contains < SCRIPT) Input Results #nodes #edges #sinks #inputs Alphabet Cut 21 20 1 {<} 29 {S,’,”} 25 2 Σ , Σ 23 22 {<,’,”} Vulnerability signature depends on two inputs

36 Relational String Analysis
Put multiple single-track DFAs to one multi-track DFA Each track represents the values of one string variable Using multi-track DFAs: Identifies the relations among string variables Generates relational vulnerability signatures for multiple user inputs of a vulnerable application Improves the precision of the path-sensitive analysis Proves properties that depend on relations among string variables, e.g., $file = $usr.txt

37 Multi-track Automata Let X (the first track), Y (the second track), be two string variables λ is a padding symbol A multi-track automaton that encodes X = Y.txt (λ,t) (λ,x) (λ,t) (a,a), (b,b) …

38 Relational Vulnerability Signature
Performs forward analysis using multi-track automata to generate relational vulnerability signatures Each track represents one user input An auxiliary track represents the values of the current node Intersects the auxiliary track with the attack pattern upon termination

39 Relational Vulnerability Signature
Consider a simple example having multiple user inputs <?php 1: $www = $_GET[”www”]; 2: $url = $_GET[”url”]; 3: echo $url. $www; ?> Let the attack pattern be (Σ - <)∗ < Σ∗

40 Relational Vulnerability Signature
A multi-track automaton: ($url, $www, aux) Identifies the fact that the concatenation of two inputs contains < (a,λ,a), (b,λ,b), (a,λ,a), (b,λ,b), (<,λ,<) (λ,a,a), (λ,b,b), (λ,<,<) (λ,a,a), (λ,b,b), (λ,<,<) (λ,a,a), (λ,b,b), (λ,a,a), (λ,b,b),

41 Relational Vulnerability Signature
Project away the auxiliary variable Find the min-cut This min-cut identifies the alphabet cuts {<} for the first track ($url) and {<} for the second track ($www) (a,λ), (b,λ), (a,λ), (b,λ), (<,λ) (λ,a), (λ,b), (λ,<) (λ,a), (λ,b), (λ,<) (λ,a), (λ,b), (λ,a), (λ,b), min-cut is {<},{<}

42 Patch for Multiple Inputs
Patch: If the inputs match the signature, delete its alphabet cut <?php if (preg match(’/[^ <]*<.*/’, $ GET[”url”].$ GET[”www”])) { $ GET[”url”] = preg replace(<,””,$ GET[”url”]); $ GET[”www”] = preg replace(<,””,$ GET[”www”]); } 1: $www = $ GET[”www”]; 2: $url = $ GET[”url”]; 3: echo $url. $www; ?>

43 Technical Issues To conduct relational string analysis, we need to compute ”intersection” of multi-track automata Intersection is closed under aligned multi-track automata λs are right justified in all tracks, e.g., abλλ instead of aλbλ However, there exist unaligned multi-track automata that are not describable by aligned ones We propose an alignment algorithm that constructs aligned automata which over or under approximate unaligned ones

44 Other Technical Issues
Modeling Word Equations: Intractability of X = cZ: The number of states of the corresponding aligned multi-track DFA is exponential to the length of c. Irregularity of X = YZ: X = YZ is not describable by an aligned multi-track automata We have proven the above results We proposed a conservative analysis Constructs multi-track automata that over or under-approximate the word equations

45 Composite Analysis What I have talked about so far focuses only on string contents It does not handle constraints on string lengths It cannot handle comparisons among integer variables and string lengths We extended our string analysis techniques to analyze systems that have unbounded string and integer variables We proposed a composite static analysis approach that combines string analysis and size analysis

46 Size Analysis Size Analysis: The goal of size analysis is to provide properties about string lengths It can be used to discover buffer overflow vulnerabilities Integer Analysis: At each program point, statically compute the possible states of the values of all integer variables. These infinite states are symbolically over-approximated as linear arithmetic constraints that can be represented as an arithmetic automaton Integer analysis can be used to perform size analysis by representing lengths of string variables as integer variables.

47 An Example Consider the following segment:
1: <?php 2: $www = $ GET[”www”]; 3: $l otherinfo = ”URL”; 4: $www = ereg replace(”[^A-Za-z0-9 5: if(strlen($www) < $limit) 6: echo ”<td>” . $l otherinfo . ”: ” . $www . ”</td>”; 7:?> If we perform size analysis solely, after line 4, we do not know the length of $www If we perform string analysis solely, at line 5, we cannot check/enforce the branch condition.

48 Composite Analysis We need a composite analysis that combines string analysis with size analysis. Challenge: How to transfer information between string automata and arithmetic automata? A string automaton is a single-track DFA that accepts a regular language, whose length forms a semi-linear set For example: {4, 6} ∪ {2 + 3k | k ≥ 0} The unary encoding of a semi-linear set is uniquely identified by a unary automaton The unary automaton can be constructed by replacing the alphabet of a string automaton with a unary alphabet

49 Arithmetic Automata An arithmetic automaton is a multi-track DFA, where each track represents the value of one variable over a binary alphabet If the language of an arithmetic automaton satisfies a Presburger formula, the value of each variable forms a semi-linear set The semi-linear set is accepted by the binary automaton that projects away all other tracks from the arithmetic automaton

50 Connecting the Dots String Automata Unary Length Automata
We developed novel algorithms to convert unary automata to binary automata and vice versa Using these conversion algorithms we can conduct a composite analysis that subsumes size analysis and string analysis String Automata Unary Length Automata Binary Length Automata Arithmetic Automata

51 Stranger: A String Analysis Tool
Stranger is available at: Uses Pixy [Jovanovic et al., 2006] as a PHP front end Uses MONA [Klarlund and Møller, 2001] automata package for automata manipulation Attack patterns Pixy Front End Symbolic String Analysis String/Automata Operations Parser Automata Based String Manipulation Library String Analyzer Dependency Graphs Stranger Automata PHP program CFG DFAs Dependency Analyzer String Analysis Report (Vulnerability Signatures) MONA Automata Package

52 Case Study Schoolmate 1.5.4 Number of PHP files: 63
Lines of code: 8181 Forward Analysis results After manual inspection we found the following: Time Memory Number of XSS sensitive sinks Number of XSS Vulnerabilities 22 minutes 281 MB 898 153 Actual Vulnerabilities False Positives 105 48

53 Case Study – False Positives
Why false positives? Path insensitivity: 39 Path to vulnerable program point is not feasible Un-modeled built in PHP functions : 6 Unfound user written functions: 3 PHP programs have more than one execution entry point We can remove all these false positives by extending our analysis to a path sensitive analysis and modeling more PHP functions

54 Case Study - Sanitization
We patched all actual vulnerabilities by adding sanitization routines We ran stranger the second time Stranger proved that our patches are correct with respect to the attack pattern we are using

55 Related Work: String Analysis
String analysis based on context free grammars: [Christensen et al., SAS’03] [Minamide, WWW’05] String analysis based on symbolic/concolic execution: [Bjorner et al., TACAS’09] Bounded string analysis : [Kiezun et al., ISSTA’09] Automata based string analysis: [Xiang et al., COMPSAC’07] [Shannon et al., MUTATION’07] Application of string analysis to web applications: [Wassermann and Su, PLDI’07, ICSE’08] [Halfond and Orso, ASE’05, ICSE’06]

56 Related Work Size Analysis
Size analysis: [Hughes et al., POPL’96] [Chin et al., ICSE’05] [Yu et al., FSE’07] [Yang et al., CAV’08] Composite analysis: [Bultan et al., TOSEM’00] [Xu et al., ISSTA’08] [Gulwani et al., POPL’08] [Halbwachs et al., PLDI’08] Vulnerability Signature Generation Test input/Attack generation: [Wassermann et al., ISSTA’08] [Kiezun et al., ICSE’09] Vulnerability signature generation: [Brumley et al., S&P’06] [Brumley et al., CSF’07] [Costa et al., SOSP’07]

57 VLab String Analysis Publications
Publications by Fang Yu and Muath Alkhalaf Stranger: An Automata-based String Analysis Tool for PHP [TACAS’10] Generating Vulnerability Signatures for String Manipulating Programs Using Automata-based Forward and Backward Symbolic Analyses [ASE’09] Symbolic String Verification: Combining String Analysis and Size Analysis [TACAS’09] Symbolic String Verification: An Automata-based Approach [SPIN’08]

58 Web applications are error prone
Most web applications have navigation errors where an unexpected user request can cause a web application to display cryptic error messages display sensitive information that might be exploited by malicious users execute an unintended action

59 Why are web applications error prone?
Script-oriented programming: A web application consists of a collections of scripts These scripts call each other indirectly through interaction by the user and the browser The form that one script generates has the address of the next script that will consume the user input There are no systematic checks that guarantee that the caller and the callee agree on an interface For example in a procedure call, the caller and the callee must agree on the number of arguments and their types There is no explicit control flow identifying the execution order The control flow is buried in the links of the generated html pages

60 Why are web applications error prone?
Interactivity User interaction is not under the control of the developer The back button of the browser The user can open multiple windows The user can cut and paste the URL Statefull interaction over stateless protocols (HTTP) Interactions between different software components browser, server, back-end database the need to maintain session state across these components One web application can be composed of many applications Mash-ups, web services

61 Navigation errors: Bamboo Invoice

62 Navigation errors: Bamboo Invoice

63 Navigation errors: Digitalus

64 Navigation errors: Digitalus

65 Navigation errors: Digitalus

66 Request processing in a Web application
Request processing in Web applications that use MVC frameworks

67 Navigation modeling and analysis
We developed a simple language to specify navigation state machines It is a state machine that shows the allowable sequences of controller action executions in a web application MVC frameworks typically use a hierarchical structure where actions are combined in the controllers and controllers are grouped into modules We exploit this hierarch to specify the navigation state machines as hierarchical state machines

68 Navigation state machines
The states of a navigation state machine is defined by the values of the session variables, the last action executed by the application And the request parameters that were sent with the last action We assume that this information is enough to figure out what are the next actions that can be executed by the application

69 What can we do with NSMs If we can check that the web application conforms to the NSM, then we can verify navigation properties on the NSM and conclude that the navigation properties hold for the application We can also use automated verification techniques to check properties of NSMs This way we can eliminate the navigation errors Big problem: How do we ensure that the application conforms to the NSM? Two approaches Automatically extract the NSM from the application Use runtime enforcement to make sure that the application follows the NSM Or use a combination of these two

70 Runtime Enforcement Statically verifying that a web application conforms to a navigation state machine is a very difficult problem (in general undecidable) So, instead, we use runtime enforcement We have a plugin that can be easily added to an MVC web application that takes a NSM as input and makes sure that every incoming request conforms to the NSM If the incoming request does not obey the NSM, then the plugin either ignores the request and refreshes the previous page or generates an appropriate error message This way non-compliant user requests can be handled uniformly without generating strange error messages

71 THE END

72 Modular Verification of Web Services
Client and server side verification for web services

73 Interface Grammars: Client vs. Server
Interface grammars: A formalism for interface specication Interface grammars can be used for Client verification: Generate a stub for the server Server verification: Generate a driver for the server Parser Client Stub Interface Compiler Interface Server Driver Sentence Generator

74 Interface Grammars for Web Services


Download ppt "String Analysis Tevfik Bultan Verification Lab"

Similar presentations


Ads by Google