Download presentation
Presentation is loading. Please wait.
Published byAugust Stephens Modified over 9 years ago
1
STRINGS AND AUTOMATA MODULO THEORIES Margus Veanes July 18, 2015SMT'15, San Fransisco1
2
Symbolic execution – Path feasibility analysis involving string constraints – Regular expression matching Security vulnerabilities – SQL injection attacks – XSS attacks – DoS attacks e.g. regex injection – Directory traversal attacks … Data processing – Parallelization – Deforestation Malware detection MOTIVATION July 18, 2015SMT'15, San Fransisco2 [OWASP] top 1,3 culprits http://foo.bar.system/scripts/..%c1%1c../winnt/ system32/cmd.exe?/c+dir+c:\
3
“EARLY” WORK RELATED TO STRING ANALYSIS Tools – Mona: Henriksen-Jensen-Jørgensen-Klarlund-Paige-Rauhe-Sandholm, TACAS’95 Built on BRICS automata library – JSA: Christensen-Møller-Schwartzbach, SAS’03 (Uses BRICS) – Haderach: Shannon-Hajra-Lee-Zhan-Khurshid, MUTATION’07 (Uses BRICS) Theory – Bjørner, PhD Thesis’98, Decision procedure for queues – Blumensath-Grädel, LICS’00 (automatic structures) – Benedikt-Libkin-Schwentick-Segoufin, LICS’01 (regular string relations) – Khoussainov-Nies-Rubin-Stephan, LICS’04 (automatic Boolean Algebras) – Bala, STACS’04, (regular term matching) – Kunc, DLT’2007, (complexity of language equations) July 18, 2015SMT'15, San Fransisco3
4
THE RISE OF THE STRING ANALYZERS String theory encodings in SMT: – Pex-LL: Bjørner-Tillmann-Voronkov, TACAS’09 (strings + SMT) – Reggae: Li-Xie-Tillmann-deHalleux-Schulte, ASE’09 (symolic exploration of regex code) – Z3-str: Zheng-Zhang-Ganesh, ESEC/FSE 2013 (plugin to Z3) – CVC4-str: Liang-Reynolds-Tinelli-Barrett-Deters, CAV’14 (DPLL(T SLRp )) – S3: Trinh-Chu-Jaffar, CCS’14 (uses Z3-str-star) Automata related: – Stranger: Yu-Alkhalaf-Bultan-Ibarra-Cova, SPIN’08, TACAS’09, TACAS’10 (automata based) – DPRLE: Hooimeijer-Weimer, PLDI’09 (subset checking) – Hampi: Kiezun-Ganesh-Guo-Hooimeijer-Ernst, ISSTA’09 (best paper award) (reduction to BV) – Kaluza(in Kudzu): Saxena-Akhawe-Hanna-Mao-McCamant-Song, Okland’10 (Hampi + mult.var.) – Rex: Veanes-deHalleux-Tillmann-Bjørner-deMoura, ICST’10, LPAR’2010 (language acceptors) – Bek: Hooimeijer-Livshits-Molnar-Saxena-Veanes-Bjørner, USENIX Security'11, POPL’12 (transducers) – Bex: D’Antoni-Veanes, VMCAI’13, CAV’13 (lookahead) – PASS: Li-Ghosh, HVC 2013 (best paper award). (array based) – SMC: Luu-Shinde-Saxena-Demsky, PLDI’14 (model counting) CAV’15: – ABC: Aydin-Bang-Bultan (automata based counting, using Stranger and BRICS) – NORN: Abdulla-Atig-Chen-Holik-Rezine-Rümmer-Stenman, also CAV’14 (Horn clauses, BRICS) – Z3-str + : Zheng-Ganesh-Subramanian-Tripp-Dolby-Zhang. (string + regex + length ) July 18, 2015SMT'15, San Fransisco4
5
TWO QUESTIONS What are characters? What are strings? July 18, 2015SMT'15, San Fransisco5 smileycipher (“hello world”) = “ ” Is this a string function?
6
WHAT ARE CHARACTERS? 1.Elements of a Finite Alphabet ? – Only primitive operation is =: Bool – What about Unicode, e.g., http://unicode.org/charts/PDF/U1F600.pdf http://unicode.org/charts/PDF/U1F600.pdf | | = 1,112,064 – For succinctness allow total order ≺ : Bool and ranges [a-b] (denotes {x | a ≼ x ≼ b}) This affects the notion of automaton over ! Why not other operations as well? 2.Bit-vectors, say char (BV16) ? – With primitive operations like &: char char char – “” = “\uD83D\uDE00” (UTF16 surrogate pair) has its own theory, namely bv theory! 3.Integers (code points) ? – = 0x1F600 = 128512 – e.g. + 1 = = 0x1F601 has its own theory, namely int theory! … July 18, 2015SMT'15, San Fransisco6
7
WHAT ARE STRINGS? Finite sequences of characters (char) – CVC4-str Singleton string = char Restricted arrays of int to char – Pex-LL, PASS array ≠ char singleton string ≠ char Finite lists of characters – Pex-Rex list ≠ char singleton string ≠ char Finite queues – transducers The answer depends on the context and the required operations. – First, Last, Rest, Append, Substring, Length, … July 18, 2015SMT'15, San Fransisco7
8
ANALYSIS TASKS Consider character type C, string type S, and regular expression type R. – When is DPLL(T C,T S,T R ) possible/feasible? What about (finite state) transducers? – Regular transformations of type S S – Typically T in = T out = bit-vectors – Many string transformations are such: sanitizers, encoders July 18, 2015SMT'15, San Fransisco8
9
HTML ENCODER July 18, 2015SMT'15, San Fransisco9 Arithmetic operations on characters
10
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 10July 18, 2015SMT'15, San Fransisco
11
THE REST OF THE TALK Symbolic Automata and Transducers BEK and string sanitizers BEX and string encoders Data parallel BEK/BEX for string processing 11July 18, 2015SMT'15, San Fransisco
12
SYMBOLIC FINITE AUTOMATA July 18, 2015SMT'15, San Fransisco12
13
SYMBOLIC FINITE AUTOMATON (SFA) Labels are predicates q p x. 'a' ≤ x ≤ 'd' July 18, 2015SMT'15, San Fransisco13 one symbolic transition: denotes many concrete transitions: q p 'a' ‘c' ‘b' 'd' for x 〚 'a' ≤ x ≤ 'd' 〛
14
SFA EXECUTION EXAMPLE 14 λx. x mod 2=0 λx. x mod 2=1 p q λx. x mod 2 =0λx. x mod 2=1 1253 ppqpp p is final accept the input July 18, 2015SMT'15, San Fransisco
15
SYMBOLIC FINITE AUTOMATA What is the alphabet? July 18, 2015SMT'15, San Fransisco15
16
ALPHABET IS AN EFFECTIVE BOOLEAN ALGEBRA July 18, 2015SMT'15, San Fransisco16 Domain Predicates P 2 D ( D,P, 〚_〛, , T, , , )
17
ALPHABET EXAMPLE July 18, 2015SMT'15, San Fransisco17 {a,b} { ,{a},{b},{a,b}} id {a,b} c p q {a} {b} a*b(a|b)* SFA over 2 {a,b} : regex : 2 {a,b} = ( D,P, 〚_〛, , T, , , )
18
ALPHABET EXAMPLE: 2 BVK D = {n | 0 n < 2 k } P = BDDs of depth k Boolean operations are BDD operations Below 〚 i 〛 = {n D | i'th bit of n is 1} July 18, 2015SMT'15, San Fransisco18 i has fixed size independent of i
19
ALPHABET EXAMPLE: SMT INT D = Integers P = integer linear arithmetic formulas (with one fixed free variable) 〚 〛= 〚 〛 〚 〛 〚 〛= ,〚 〛= D \〚 〛 Satisfiability: 〚 〛 July 18, 2015SMT'15, San Fransisco19
20
BOOLEAN ALGEBRA INTERFACE IN C# July 18, 2015SMT'15, San Fransisco20 public interface IBoolAlg { P Top { get; } P Bot { get; } P Not(P pred); P Or(P pred1, P pred2); P And(P pred1, P pred2); bool IsSat(P predicate); } public interface IBoolAlgExt : IBoolAlg { IEnumerable Den(P); P One(D); }
21
UNIT ALPHABET EXAMPLE IN C# July 18, 2015SMT'15, San Fransisco21 class A1 : IBoolAlg { public bool Top { get { return true; } } public bool Bot { get { return false; } } public bool Not(bool pred) { return !pred; } public bool Or(bool pred1, bool pred2) { return pred1 || pred2; } public bool And(bool pred1, bool pred2) { return pred1 && pred2; } public bool IsSat(bool pred){ return pred; } } One-letter alphabet
22
ANOTHER ALPHABET EXAMPLE IN C# July 18, 2015SMT'15, San Fransisco22 class A16 : IBoolAlg { public UInt16 Top { get { return 0xFFFF; } } public UInt16 Bot { get { return 0; } } public UInt16 Not(UInt16 pred) { return ~pred; } public UInt16 Or(UInt16 pred1, UInt16 pred2) { return pred1 | pred2; } public UInt16 And(UInt16 pred1, UInt16 pred2) { return pred1 & pred2; } public bool IsSat(UInt16 pred){ return pred != 0; } } 16-letter alphabet
23
ALPHABET TRANSFORMATIONS Effective Boolean algebras can be extended – e.g. disjoint union Effective Boolean algebras can be restricted – e.g. restriction wrt. a given predicate July 18, 2015SMT'15, San Fransisco23
24
DISJOINT UNION OF ALPHABETS IN C# July 18, 2015SMT'15, San Fransisco24 public class PairAlg : IBoolAlg > { IBoolAlg A; IBoolAlg B; Pair Bot {get return new Pair (A.Bot,B.Bot);} … public Pair Or(Pair a, Pair b) { return new Pair (A.Or(a[0],b[0]), B.Or(a[1],b[1])); } public bool IsSat(Pair p) { return A.IsSat(p[0]) || B.IsSat(p[1]); }
25
SFA VS. CLASSICAL AUTOMATA? SFAs can support infinite alphabets For some cases SFAs are exponentially more succinct than NFAs Example (recall the BDDs i from before): Equivalent NFA requires 2 k transitions. July 18, 2015SMT'15, San Fransisco25
26
SYMBOLIC FINITE AUTOMATA Algorithms over SFAs. July 18, 2015SMT'15, San Fransisco26
27
ALGORITHMS OVER SFAS Language intersection – Uses product of automata Language complementation – Requires determinization Minimization – Extensions of Moore/Hopcroft [POPL’14] Regex SFA construction – Uses BDDs to represent Unicode character sets – Requires BDD interval-set conversions May cause exponential blowup: recall the BDDs i July 18, 2015SMT'15, San Fransisco27
28
LANGUAGE INTERSECTION Uses DFS and product of transitions July 18, 2015SMT'15, San Fransisco28 p1p1 q1q1 p2p2 q2q2 A: B: p1p2p1p2 AB:AB: q1q2q1q2 delete when unsat X
29
INTERSECTION EXAMPLE July 18, 2015SMT'15, San Fransisco29 a1a1 a2a2 22 A: B: 66 6 b1b1 33 a1b1a1b1 a2b2a2b2 2 3 6 3 a1b2a1b2 let k (x) ((x mod k) = 0) AB:AB: b2b2 6 3 X
30
LANGUAGE COMPLEMENTATION First determinize then swap final and nonfinal states July 18, 2015SMT'15, San Fransisco30 p q r {p} {q} {q,r} {r} delete unsat guards determinize
31
MINIMIZATION (SYMBOLIC MOORE) D := (F (Q\F)) ((Q\F) F) foreach (p’,q’) D, (p,q) D if (IsSat( guard(p,p’) ∧ guard(q,q’) )) add (p,q) to D 31 p q p’ q’ distinguishable φ ψ IsSat(φ ∧ ψ) July 18, 2015SMT'15, San Fransisco
32
REGEX SFA Classical algorithm extended to work with predicates – First produces SFA (SFA with -moves ) – Then -moves are eliminated using the standard -elimination algorithm – Requires interval-set BDD algorithm for converting character classes Example: [\0x0-\0xFF] = BDD whose bits in pos. > 7 are 0 July 18, 2015SMT'15, San Fransisco32
33
ONLINE SFA ALGORITHM EXAMPLES http://www.rise4fun.com/Bex/zE July 18, 2015SMT'15, San Fransisco33
34
SYMBOLIC FINITE TRANSDUCERS July 18, 2015SMT'15, San Fransisco34
35
SYMBOLIC FINITE TRANSDUCER (SFT) Labels are guarded transformation functions Concrete transitions: p q Symbolic transition: ‘\x80’/ “\xC2\x80” … ‘\x7FF’/ “\xDF\xBF” q p x. 80 16 ≤ x ≤ 7FF 16 / [C0 16 | x 10,6 , 80 16 | x 5,0 ] guard bitvector operations 1920 transitions SMT'15, San Fransisco35July 18, 2015
36
SFT EXECUTION EXAMPLE 36 x mod 2 =0/[x, x] x mod 2 =1/[x-1] p q x mod 2 =0/[]x mod 2 =1/[x-1] 1253 ppqpp Input tape Output tape 02 42 July 18, 2015SMT'15, San Fransisco
37
SYMBOLIC FINITE TRANSDUCERS Properties and algorithms July 18, 2015SMT'15, San Fransisco37
38
WHY SFTS? They have good algebraic properties (POPL'12) – SFTs are closed under composition – Equivalence is decidable in the single-valued case – domain of an SFT is an SFA SFAs are closed under Boolean operations Useful for various analysis tasks July 18, 2015SMT'15, San Fransisco38
39
SFT COMPOSITION A B = x.B(A(x)) July 18, 2015SMT'15, San Fransisco39 a1 a2 A B x>0/ [x+1,x+2] b1 b2 x<5/ [] b3 x<4/[x,x] ABAB a1 b1 x>0 x+1<5 x+2<4 / [x+2, x+2] a2 b3
40
Composition: Equiv. checking for single-valued-SFTs: (undecidable in general) Algorithms use SMT for satisfiability checking of character formulas SFT A B SFT ALGORITHMS July 18, 2015SMT'15, San Fransisco40 inout SFT B inout SFT A inout SFT A inout SFT B “input string” A and B not equivalent
41
PROPERTY ANALYSIS (USENIX SEC'11) Does it matter if a sanitizer is applied twice? Idempotence: Does order of sanitizers matter? Commutativity: July 18, 2015SMT'15, San Fransisco41 “input string” A not idempotent A A AA A “input string” A and B not commutative B A BA A B AB
42
APPLICATIONS July 18, 2015SMT'15, San Fransisco42
43
APPLICATIONS OF SFAS/SFTS SFAs: – Regex support in parameterized unit testing – Fuzz testing of regexes – Password generation SFTs: – Analysis of string encoders/decoders – Security analysis of sanitizers July 18, 2015SMT'15, San Fransisco43
44
APPLICATION 1 REGEXES IN PARAMETERIZED UNIT TESTING Rex component in Pex Generate values for s that reach the return branches – s is a string of Unicode characters (16-bit bit-vectors) July 18, 2015SMT'15, San Fransisco44 bool IsValidEmail(string s) { string r1 = @"^[A-Za-z0-9]+@(([A-Za-z0-9\-])+\.)+([A-Za-z\-])+$"; string r2 = @"^\d.*$"; if (System.Text.RegularExpressions.Regex.IsMatch(s, r1)) if (System.Text.RegularExpressions.Regex.IsMatch(s, r2)) return false; //branch 1 else return true; //branch 2 else return false; //branch 3 } Solve: s L(r1) L(r2) [eg. s = “3@a.b”] Solve: s L(r1)\L(r2) [eg. s = “a@b.c”] Solve: s L(r1) [eg. s = “a@..c”]
45
APPLICATION 2 PASSWORD GENERATION Given constraints: Length is k: "^[\x21-\x7E]{k}$" Contains 2 capital letters: "[A-Z].*[A-Z]" Contains a digit: "\d" Contains a non-word character: "\W" Generate random instances with uniform distribution that match all the above conditions. k=4 : http://www.rise4fun.com/Rex/4nEhttp://www.rise4fun.com/Rex/4nE http://www.rise4fun.com/Bek/c3j July 18, 2015SMT'15, San Fransisco45
46
APPLICATION 3 SAFETY ANALYSIS Example: suppose good output = “NoEars" NoEars = [^\uDE38-\uDE40]* bad output: WithEars = Complement(NoEars) x (smileycipher( x ) WithEars ) ? {x | smileycipher (x) WithEars} Does there exist an input x that causes “ears" in the output ? http://www.rise4fun.com/Bek/5sHO July 18, 2015SMT'15, San Fransisco46
47
EXTENSIONS July 18, 2015SMT'15, San Fransisco47
48
EXTENSIONS OF SFAS AND SFTS ESFT – SFA/SFTswith look-ahead [CAV'13] – BEX language STT – Symbolic automata/transducer over trees – FAST language [PLDI’14] k-SFT – SFT with lookback [POPL’15] July 18, 2015SMT'15, San Fransisco48
49
ESFAS AND ESFTS Unlike in the classical case look-ahead breaks many properties – e.g. equivalence of ESFAs is undecidable July 18, 2015SMT'15, San Fransisco49 x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F] q above ESFT, reads 3 and writes 4 symbols (base64encoder) http://www.rise4fun.com/Bex/tutorial/guide ManMan T W FuT W Fu
50
FAST (TREE TRANSDUCERS) Trees Trees are common input/output data structures – XML – XML query, type-checking, etc… – Natural Language – Natural Language translators (from parse tree to parse tree) – Compilers – Compilers/optimizers (from parse tree to parse tree) data structures ontologies – Tree manipulating programs: data structures algorithms, ontologies, etc… – Augmented Reality – http://www.rise4fun.com/Fast/tutorial/guide http://www.rise4fun.com/Fast/tutorial/guide 50SMT'15, San FransiscoJuly 18, 2015
51
51 Transducer Model Z3 Transformation Analysis Does it do the right thing? Analysis question Automata-.NET s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR RECIPE FOR EACH TASK July 18, 2015SMT'15, San Fransisco
52
Automata-.NET will be open source on GitHub under MIT license Some references: BEK Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string encoders and decoders D’Antoni, Veanes, VMCAI13 Equivalence of extended symbolic finite transducers D’Antoni, Veanes, CAV13 Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 52July 18, 2015SMT'15, San Fransisco
53
QUESTIONS? Links to related online tutorials: – Bek http://rise4fun.com/Bek/tutorial http://rise4fun.com/Bek/tutorial – Bex http://rise4fun.com/Bex/tutorial http://rise4fun.com/Bex/tutorial – Rex http://rise4fun.com/rex/ http://rise4fun.com/rex/ – Fast http://rise4fun.com/Fast/tutorial http://rise4fun.com/Fast/tutorial SMT'15, San Fransisco53July 18, 2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.