PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes
2
3
4
5
6 All features of general purpose language Features needed replace, match, char…
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 7
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 8
AUTOMATA, TRANSDUCERS, AND PROGRAMS 9
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 10
11 type alphabet = A | T | C | G let rec all_TG (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = T || h = G) && (all_TG t ) let rec all_AC (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = A || h = C) && (all_TG t ) let rec map_base (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) | T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> A : : ( filter_AC t ) | T : : t -> filter_AC t | G : : t -> filter_AC t | C : : t -> C : : ( filter_AC t ) Finite alphabet Languages of strings Transformations from strings to strings q0q0 T G q0q0 A C all_TGall_AC ε A/T map_base T/A G/CC/G ε A/A T/ε G/εC/C filter_AC
FINITE AUTOMATA 12 a b a b ababYes abaNo bbYes aNo
FINITE STATE TRANSDUCERS 13 a/aa b/bb zz a/aa b/bb abaabbzz bbbzz abaUNDEFINED a
BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability for automata: Intersection, union, complement Decidable emptiness Decidable equivalence Can be minimized 14
BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let m_f_DNA l : base list = filter_AC (map_base l) 15 q0q0 A/T map_base T/A G/CC/G q0q0 A/AT/ε G/εC/C filter_AC q0q0 A/εT/ A G/CC/ε m_f_DNA
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking map_base o (¬ all_AC) 16 input in all_TG map_base output in all_AC map_base only defined if output in (¬ all_AC)
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) 17 input in all_TG map_base output in all_AC Inputs for which map_base does not output in all_AC
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) ∩ all_TG = ∅ 18 input in all_TG map_base output in all_AC
BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ? 19
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 20
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 21
[USENIX11, POPL12] P. HooimeijerM. VeanesB. LivshitsD. Molnar BEK analysis of string sanitizers P. Saxena
23
24
25 Q UESTION : What could possibly go wrong?
26 Attacker: gollum.png' onload='javascript:...
27 Attacker: gollum.png' onload='javascript:... Result: <img src='gollum.png' onload='javascript:…
28 Attacker: im.png' onload='javascript:... Result: <img src='im.png' onload='javascri I found my PRECIOUSS S.
29
FIRST LINE OF DEFENSE: SANITIZERS Sanitizer: a string transformation function. PLDI'12 submission presentations 30 “im.png' …”“img.png' …” Sanitized dataUntrusted data Dec 8, 2011
COMPARING SANITIZERS 31
32 ' ' single quote html entity
33 some untrusted input
34 Library A Name: Around for: Availability: HtmlEncode Years Readily available to C# developers some untrusted input
35 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers some untrusted input
36 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers ' ' ' ' ✔ ✘
37 public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); }.NET WebUtility MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); }
private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); } public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); } 38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not, what is a differentiating input? Can it generate any known ‘bad’ outputs?
39 PHP Trunk Changes to html.c,
40 PHP Trunk Changes to html.c, 1999—2011 R7,841 April loc R309,482 March loc
41 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R7,841 April loc R309,482 March loc
42 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R242,949 September 2007 $double_encode=true R7,841 April loc R309,482 March loc
43 PHP Trunk Changes to html.c, 1999—2011 Safe to apply twice? Safe to combine with other sanitizers?
MOTIVATION 44 Writing string sanitizers correctly is difficult There is no cheap way to identify problems with sanitizers ‘Correctness’ is a moving target What if we could say more about sanitizer behavior?
CONTRIBUTIONS 45 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation
CONTRIBUTIONS 46 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity
47 s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
48 Symbolic Finite Transducers Z3 Transformation Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
49 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
50 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
51 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
52 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES
53 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s
54 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s while updating one boolean variable b Simple dedicated syntax
55 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
FINITE STATE TRANSDUCERS 56 a/A Problem: alphabet has 2 16 characters TOO MANY TRANSITIONS b/B z/Z … … &/&
SYMBOLIC FINITE TRANSDUCERS 57 Only two transitions!! x in [a-z] / x-32 x not in [a-z] / x
SYMBOLIC FINITE TRANSDUCERS 58 x>5/x+1,x x%2=1/x-1,x,x+4 true/5 true/x-4 Predicates Sequence of functions Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
59 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
60 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen Now what? BEK ARCHITECTURE
SFT Algorithms 61 EQUIVALENCE CHECKING IS DECIDABLE! Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
SFT Algorithms 62 AntiXSS.HtmlEncode = WebUtility.HtmlEncode EQUIVALENCE CHECKING
63 SFT A B inout SFT A inout SFT B CLOSED UNDER COMPOSITION
SFT Algorithms 64 SFT A B inout SFT A inout SFT B JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w)) COMPOSITION
65 PRE-IMAGE COMPUTATION Regular Language O Regular Language I outin SFT A
66 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability signature outin SFT A
67 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Contributions Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity CONTRIBUTIONS
68 Can BEK model existing sanitizers? Can we use to check interesting properties on real sanitizers? QUESTIONS?
Language Features 69 Data: 1x OWASP HTMLencode 13x Google AutoEscape 21x IE 8 XSS Filter 7x Synthetic inspect feature counts WHAT FEATURES ARE NEEDED?
Language Features 70 Majority (76%) of sanitizers can be ported without extending the language With multi-character lookahead: 90% WHAT FEATURES ARE NEEDED?
71 Data 4x MS internal HtmlEncode 3x ‘for hire’ HtmlEncode based on English- language specification (C#) Commutative? Equivalent? CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
72 Short answer: Yes! CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
73 Short answer: Yes! EQ results take less than a minute to obtain: ✔✔✔✘✘✔✘ 2 ✔✔✘✘✔✘ 3 ✔✘✘✔✘ 4 ✔✘✘✘ 5 ✔✘✘ 6 ✔✘ 7 ✔ CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
74 CommutativitySelf-Equivalence DOES IT SCALE?
The Cheat Sheet 75 One out of seven implementations correctly encodes all strings for use in both HTML and attribute contexts WERE ALL SANITIZERS BROKEN?
76 B EK is a domain-specific language for writing string sanitizers B EK can model programs without approximation using symbolic finite transducers, enabling e.g., equivalence checks B EK was evaluated using real-world sanitizers from a variety of different sources Conclusion BEK IN A NUTSHELL
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 77
BEX ANALYSIS OF STRING ENCODERS Loris D’AntoniMargus Veanes [VMCAI13, CAV13]
79 Hi, I’m plain text! Nice to meet you! SGkgSSdtIHBsYWluI HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Encoder Decoder
NOT SO EASY TO GET RIGHT 80
WHEN ARE THEY CORRECT? 81 T Encoder T’ Decoder T Encoder T’T
CAN WE USE TRANSDUCERS? 82 T Encoder T’ Decoder T Encoder o Decoder = Identity
Language Features 83 Majority (76%) of sanitizers can be ported without extending Bek With multi-character lookahead: 90% BEK: WHAT FEATURES WERE NEEDED?
BASE64 encoder 3 Bytes 4 Base64 characters 84 Text contentMan Bytes Bit Pattern Index Base64 EncodedTWFu
85 HOW DO WE EXTEND BEK?
86 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE Symbolic finite transducers don’t have registers
TRANSDUCERS WITH REGISTERS x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines
88 EXPLORE REGISTERS VALUES Register has finitely many values: Remember last value 2 |bits| states
89 BASE64 IN BEX DEMO
90
91 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
92 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE
EXTENDED SYMBOLIC FINITE TRANSDUCERS 93 Man… p 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3 …
EXTENDED SYMBOLIC FINITE TRANSDUCERS 94 Man… pq TWFu… 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3
MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS x 1 >x 2 / [x 1 +x 2 ] Do they still have nice properties?
WHAT DO WE NEED? 96 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence
NEGATIVE RESULTS 97 ESFAs: – equivalence is undecidable – are not closed under intersection – are not closed under complement ESFTs – equivalence is undecidable – are not closed under composition
A FRIENDLIER RESTRICTION 98
CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS 99 Negative results use binary predicates and encoders do not use this feature Only allow conjunctions of unary predicates q p x 1 <x 2 +1 q p x 1 >5 ∧ x 2 =1 / [x 1 +x 2, x 1 ]
CARTESIAN ESFA = SFA 100 Cartesian ESFAs are now equivalent to SFAs 10 x 1 >5 ∧ x 2 =1 0,1 0 x=1x>5 1
STILL MORE EXPRESSIVE THAN SFTS 101 Cartesian ESFTs are strictly more expressive than SFTs!! 10 x 1 >5 ∧ x 2 =1 / [x 1 +x 2 ] ?
WHAT DO WE NEED? 102 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence
RESULTS 103 Cartesian ESFTs – equivalence is decidable – are not closed under composition
COMPOSITION IN PRACTICE 104
105 BEK WITH REGISTERS?
TRANSDUCERS WITH REGISTERS x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines
COMPOSING CARTESIAN ESFTS 107 A Cartesian ESFTs A’B’ B Transducers with registers A’ o B’ A o B Cartesian ESFT ?
REGISTER ELIMINATION 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 0 [x 1,x 2 ] / [ x 1 +4, x 1 -2+x 2, x 2 +1 ], r:=0 0 ESFT
DOES IT WORK? 109
UNICODE UTF8 to UTF16 encoder (E) and decoder (D) 110 TestRunning Time Dom(E) = UTF1647 ms Dom(EoD) = UTF16109 ms Dom(D) = UTF8156 ms Dom(DoE) = UTF8320 ms EoD=Identity16 ms DoE=Identity24 ms Complete analysis in about a second
BASE64 Base64 encoder (E) and decoder (D) 111 TestRunning Time Dom(E) = bytes13 ms Dom(EoD) = bytes55ms Dom(D) = 6bits+76 ms Dom(DoE) = 6bits+56 ms EoD=Identity53 ms DoE=Identity19 ms
112 Cartesian Extended Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? EoD=I Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE
113 B EX is a domain-specific language for writing string encoders B EX can model programs without approximation using Cartesian extended symbolic finite transducers B EX was evaluated using real-world string encoders Conclusion BEX IN A NUTSHELL
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 114
FAST ANALYSIS OF PROGRAMS MANIPULATING TREES Loris D’AntoniMargus VeanesBen LivshitsDavid Molnar [PLDI14]
116
SOLUTION: USE AN HTML SANITIZER Remove malicious active code from HTML documents SANITIZE 117 alert(“This is Sparta!”); I swear this HTML is safe! I swear this HTML is safe!
TYPICAL TRANSFORMATIONS Remove scripts Remove malicious URLs Replace deprecated tags Given a sanitizer S: Does S always produce a safe and well-formed output? Is S defined on every possible HTML file? Does executing S twice produce the same output as executing S once? Can we execute S fast? 118 Typical transformations Interesting questions
HOW DO WE WRITE ONE? 119 DEMODEMO: 1
120
121
122
123
124
KEY IDEA: HTML CODE IS A TREE body script malicious code div p I swear this HTML is safe! 125 SANITIZE body div p I swear this HTML is safe!
MOTIVATION Trees are common input/output data structures – XML query, type-checking, etc… – Compilers/optimizers (from parse tree to parse tree) – Tree manipulating programs: data structures algorithms, ontologies, etc… 126
127 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE
CHOOSING THE RIGHT FORMALISM 128
SEMANTICS AS TRANSDUCERS Goal: find a class of tree transducers that can express the previous examples and is closed under composition 129
TOP DOWN TREE TRANSDUCERS [ENGELFRIET75] q(a(x 1,x 2 )) b(c,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: only finite alphabets ab c q q1q1 x1x1 x2x2 x1x1 130
SYMBOLIC TREE TRANSDUCERS [PSI11] q(λa.a>3,(x 1,x 2 )) λa.a+1,(λa.a-2,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets using predicates and functions Structural expressiveness: can’t delete a node without reading it first q q1q1 x1x1 x2x2 x1x1 Such that 5>3 is true 131 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the left child if it contains a script If we delete the node we can’t check that the left child contained a script divq q 132 Regular Look-Ahead (RLA) ??
REGULAR LOOK AHEAD : Transformation: delete the left child if it contains a script Rules can ask whether the children are in particular languages – p 1 : the language of trees that contain a script node – p 2 : the language of all trees Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets Structural expressiveness: good enough to express our examples div q p1p1 p2p2 q Transformation now is safe 133
DecidabilityComplexityStructuralExpressiveness Infinite alphabets Top Down Tree Transducers [Engelfriet75]VVXX Top Down Tree Transducers with Regular Look-ahead [Engelfriet76]VV~X Streaming Tree Transducers [AlurDantoni12]VXVX Data Automata [Bojanczyk98]~XXV Symbolic Tree Transducers [VeanesBjoerner11]VVXV Symbolic Tree Transducers RLAVV~V 134
COMPOSITION OF STT R This is not always possible!! Find the biggest class for which it is possible 135 T1T1 T1T1 T2T2 T2T2 T 1 o T 2
WHEN CAN WE COMPOSE? Theorem: T(x) = T 2 (T 1 (x)) definable by a Symbolic Tree Transducers with RLA if – T 1 is deterministic All our examples fall in this category 136 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
137 Symbolic Tree Transducers with RLA Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE
CASE STUDIES AND EXPERIMENTS 138
CASE STUDIES AND EXPERIMENTS Program Optimization: Deforestation of functional programs Verification: HTML sanitization Analysis of functional programs Augmented reality app store 139 Infinite Alphabets: Integer Data types
DEFORESTATION Removing intermediate data structures from programs ADVANTAGE: the program is a single transducer reads the input list only once, thanks to transducers composition 140 alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) } def mapC 2 : IList IList := compose mapC mapC
STAGES BY EXAMPLE 141 mapCmapC2 Transducers
DEFORESTATION: SPEEDUP 142 f(f(f(…f(x)...) (f;f;f;…;f)(x)
ANALYSIS OF FUNCTIONAL PROGRAMS 143
AR INTERFERENCE ANALYSIS Recognizers output data that can be seen as a tree structure Spine Hip Neck HeadKnee Ankle Foot …. 144
APPS AS TREE TRANSFORMATIONS Applications that use recognizers can be modeled as FAST programs 145 trans addHat: STree -> STree Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h), l, r) | Head(a) to Head(Hat(a))
COMPOSITION OF PROGRAMS Two FAST programs can be composed into a single FAST program p1p1p1p1 p2p2p2p2 p 1 ;p 2 146
ANOTHER RECOGNIZER 147 Room Floor Wall Table Chair …. Chair ….
INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite outputs of other apps Apps interfere when they annotate the same node of a recognizer’s output We can compose them and check if they interfere statically!! – Put checker in the AppStore and analyze Apps before approval Interfering apps Add cat earsAdd hat Add pin to a cityBlur a city Amazon Buy Now button Malicious Buy Now button 148
INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs, up to 85 functions each Check statically if they conflict pairwise for ANY possible input Checked 99% of program pair in less than 0.5 sec! For an App store these are perfectly fine
TWO PENDING PATENTS 150
151 F AST is a domain-specific language for writing tree manipulating programs F AST can model programs without approximation using Symbolic tree transducers with regular lookahead F AST was evaluated using real-world programs Conclusion FAST IN A NUTSHELL
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 152
WHAT’S NEXT 153
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 154
DREX EFFICIENT STRING MANIPULATION Loris D’Antoni Mukund Raghothaman Here at POPL15! Rajeev Alur
DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B) 156 a b a b b/b (a|b)*b iterate(choice(a->a, b->b)) a/a Execute this code in linear time left- to-right pass on the input string!!
BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes Here at POPL 15!! David MolnarBen Livshits Todd Mytkowicz
FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B) Efficient data-parallel code x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 02
PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Here at POPL 15!! Loris D’Antoni David Molnar Benjamin Livshits Margus Veanes Robert Cochran
CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) 160 Specification
YOU CAN HELP TOO! 161
INTERESTING DIRECTIONS A transducer-based language for – WebSrapers – Spradsheet transformations – Compiler optimizations – XML processing – Html rendering 162
SUMMARIZING… 163
164 Transducer Model Z3 Transformation Analysis Does it do the right thing? Analysis question Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR RECIPE FOR EACH TASK
BEK Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string encoders and decoders D’Antoni, Veanes, VMCAI13 Equivalence of extended symbolic finite transducers D’Antoni, Veanes, CAV13 Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 FAST Fast: a transducer based language for tree manipulatio D’Antoni, Veanes, Livshits, Molnar, PLDI14 165