Download presentation
Presentation is loading. Please wait.
Published bySheryl Lawrence Modified over 9 years ago
1
PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes
2
2
3
3
4
4
5
5
6
6 All features of general purpose language Features needed replace, match, char…
7
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 7
8
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 8
9
AUTOMATA, TRANSDUCERS, AND PROGRAMS 9
10
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 10
11
11 type alphabet = A | T | C | G let rec all_TG (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = T || h = G) && (all_TG t ) let rec all_AC (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = A || h = C) && (all_TG t ) let rec map_base (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) | T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> A : : ( filter_AC t ) | T : : t -> filter_AC t | G : : t -> filter_AC t | C : : t -> C : : ( filter_AC t ) Finite alphabet Languages of strings Transformations from strings to strings q0q0 T G q0q0 A C all_TGall_AC ε A/T map_base T/A G/CC/G ε A/A T/ε G/εC/C filter_AC
12
FINITE AUTOMATA 12 a b a b ababYes abaNo bbYes aNo
13
FINITE STATE TRANSDUCERS 13 a/aa b/bb zz a/aa b/bb abaabbzz bbbzz abaUNDEFINED a
14
BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability for automata: Intersection, union, complement Decidable emptiness Decidable equivalence Can be minimized 14
15
BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let m_f_DNA l : base list = filter_AC (map_base l) 15 q0q0 A/T map_base T/A G/CC/G q0q0 A/AT/ε G/εC/C filter_AC q0q0 A/εT/ A G/CC/ε m_f_DNA
16
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking map_base o (¬ all_AC) 16 input in all_TG map_base output in all_AC map_base only defined if output in (¬ all_AC)
17
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) 17 input in all_TG map_base output in all_AC Inputs for which map_base does not output in all_AC
18
BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) ∩ all_TG = ∅ 18 input in all_TG map_base output in all_AC
19
BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ? 19
20
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 20
21
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 21
22
[USENIX11, POPL12] P. HooimeijerM. VeanesB. LivshitsD. Molnar BEK analysis of string sanitizers P. Saxena
23
23
24
24
25
25 Q UESTION : What could possibly go wrong?
26
26 Attacker: gollum.png' onload='javascript:...
27
27 Attacker: gollum.png' onload='javascript:... Result: <img src='gollum.png' onload='javascript:…
28
28 Attacker: im.png' onload='javascript:... Result: <img src='im.png' onload='javascri I found my PRECIOUSS S.
29
29
30
FIRST LINE OF DEFENSE: SANITIZERS Sanitizer: a string transformation function. PLDI'12 submission presentations 30 “im.png' …”“img.png' …” Sanitized dataUntrusted data Dec 8, 2011
31
COMPARING SANITIZERS 31
32
32 ' ' single quote html entity
33
33 some untrusted input
34
34 Library A Name: Around for: Availability: HtmlEncode Years Readily available to C# developers some untrusted input
35
35 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers some untrusted input
36
36 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers ' ' ' ' ✔ ✘
37
37 public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); }.NET WebUtility MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); }
38
private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); } public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); } 38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not, what is a differentiating input? Can it generate any known ‘bad’ outputs?
39
39 PHP Trunk Changes to html.c, 1999--2011
40
40 PHP Trunk Changes to html.c, 1999—2011 R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
41
41 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
42
42 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R242,949 September 2007 $double_encode=true R7,841 April 1999 135 loc R309,482 March 2011 1693 loc
43
43 PHP Trunk Changes to html.c, 1999—2011 Safe to apply twice? Safe to combine with other sanitizers?
44
MOTIVATION 44 Writing string sanitizers correctly is difficult There is no cheap way to identify problems with sanitizers ‘Correctness’ is a moving target What if we could say more about sanitizer behavior?
45
CONTRIBUTIONS 45 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation
46
CONTRIBUTIONS 46 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity
47
47 s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
48
48 Symbolic Finite Transducers Z3 Transformation Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
49
49 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE
50
50 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
51
51 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
52
52 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES
53
53 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s
54
54 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s while updating one boolean variable b Simple dedicated syntax
55
55 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
56
FINITE STATE TRANSDUCERS 56 a/A Problem: alphabet has 2 16 characters TOO MANY TRANSITIONS b/B z/Z … … &/&
57
SYMBOLIC FINITE TRANSDUCERS 57 Only two transitions!! x in [a-z] / x-32 x not in [a-z] / x
58
SYMBOLIC FINITE TRANSDUCERS 58 x>5/x+1,x x%2=1/x-1,x,x+4 true/5 true/x-4 Predicates Sequence of functions Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
59
59 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
60
60 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen Now what? BEK ARCHITECTURE
61
SFT Algorithms 61 EQUIVALENCE CHECKING IS DECIDABLE! Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
62
SFT Algorithms 62 AntiXSS.HtmlEncode = WebUtility.HtmlEncode EQUIVALENCE CHECKING
63
63 SFT A B inout SFT A inout SFT B CLOSED UNDER COMPOSITION
64
SFT Algorithms 64 SFT A B inout SFT A inout SFT B JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w)) COMPOSITION
65
65 PRE-IMAGE COMPUTATION Regular Language O Regular Language I outin SFT A
66
66 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability signature outin SFT A
67
67 B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK Frontend: a small language for string manipulation; similar to how sanitizers are written today Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Contributions Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation Converted sanitizers from a variety of sources Checked properties like reversibility, idempotence, equivalence, and commutativity CONTRIBUTIONS
68
68 Can BEK model existing sanitizers? Can we use to check interesting properties on real sanitizers? QUESTIONS?
69
Language Features 69 Data: 1x OWASP HTMLencode 13x Google AutoEscape 21x IE 8 XSS Filter 7x Synthetic inspect feature counts WHAT FEATURES ARE NEEDED?
70
Language Features 70 Majority (76%) of sanitizers can be ported without extending the language With multi-character lookahead: 90% WHAT FEATURES ARE NEEDED?
71
71 Data 4x MS internal HtmlEncode 3x ‘for hire’ HtmlEncode based on English- language specification (C#) Commutative? Equivalent? CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
72
72 Short answer: Yes! CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
73
73 Short answer: Yes! EQ results take less than a minute to obtain: 1234567 1 ✔✔✔✘✘✔✘ 2 ✔✔✘✘✔✘ 3 ✔✘✘✔✘ 4 ✔✘✘✘ 5 ✔✘✘ 6 ✔✘ 7 ✔ CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?
74
74 CommutativitySelf-Equivalence DOES IT SCALE?
75
The Cheat Sheet 75 One out of seven implementations correctly encodes all strings for use in both HTML and attribute contexts WERE ALL SANITIZERS BROKEN?
76
76 B EK is a domain-specific language for writing string sanitizers B EK can model programs without approximation using symbolic finite transducers, enabling e.g., equivalence checks B EK was evaluated using real-world sanitizers from a variety of different sources Conclusion BEK IN A NUTSHELL
77
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 77
78
BEX ANALYSIS OF STRING ENCODERS Loris D’AntoniMargus Veanes [VMCAI13, CAV13]
79
79 Hi, I’m plain text! Nice to meet you! SGkgSSdtIHBsYWluI HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Encoder Decoder
80
NOT SO EASY TO GET RIGHT 80
81
WHEN ARE THEY CORRECT? 81 T Encoder T’ Decoder T Encoder T’T
82
CAN WE USE TRANSDUCERS? 82 T Encoder T’ Decoder T Encoder o Decoder = Identity
83
Language Features 83 Majority (76%) of sanitizers can be ported without extending Bek With multi-character lookahead: 90% BEK: WHAT FEATURES WERE NEEDED?
84
BASE64 encoder 3 Bytes 4 Base64 characters 84 Text contentMan Bytes7797110 Bit Pattern010011010110000101101110 Index1922546 Base64 EncodedTWFu
85
85 HOW DO WE EXTEND BEK?
86
86 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE Symbolic finite transducers don’t have registers
87
TRANSDUCERS WITH REGISTERS 87 12 x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines
88
88 EXPLORE REGISTERS VALUES Register has finitely many values: Remember last value 2 |bits| states
89
89 BASE64 IN BEX DEMO
90
90
91
91 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE
92
92 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE
93
EXTENDED SYMBOLIC FINITE TRANSDUCERS 93 Man… p 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3 …
94
EXTENDED SYMBOLIC FINITE TRANSDUCERS 94 Man… pq TWFu… 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3
95
MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS 95 10 x 1 >x 2 / [x 1 +x 2 ] Do they still have nice properties?
96
WHAT DO WE NEED? 96 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence
97
NEGATIVE RESULTS 97 ESFAs: – equivalence is undecidable – are not closed under intersection – are not closed under complement ESFTs – equivalence is undecidable – are not closed under composition
98
A FRIENDLIER RESTRICTION 98
99
CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS 99 Negative results use binary predicates and encoders do not use this feature Only allow conjunctions of unary predicates q p x 1 <x 2 +1 q p x 1 >5 ∧ x 2 =1 / [x 1 +x 2, x 1 ]
100
CARTESIAN ESFA = SFA 100 Cartesian ESFAs are now equivalent to SFAs 10 x 1 >5 ∧ x 2 =1 0,1 0 x=1x>5 1
101
STILL MORE EXPRESSIVE THAN SFTS 101 Cartesian ESFTs are strictly more expressive than SFTs!! 10 x 1 >5 ∧ x 2 =1 / [x 1 +x 2 ] ?
102
WHAT DO WE NEED? 102 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence
103
RESULTS 103 Cartesian ESFTs – equivalence is decidable – are not closed under composition
104
COMPOSITION IN PRACTICE 104
105
105 BEK WITH REGISTERS?
106
TRANSDUCERS WITH REGISTERS 106 12 x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines
107
COMPOSING CARTESIAN ESFTS 107 A Cartesian ESFTs A’B’ B Transducers with registers A’ o B’ A o B Cartesian ESFT ?
108
REGISTER ELIMINATION 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 0 [x 1,x 2 ] / [ x 1 +4, x 1 -2+x 2, x 2 +1 ], r:=0 0 ESFT 108 2 2
109
DOES IT WORK? 109
110
UNICODE UTF8 to UTF16 encoder (E) and decoder (D) 110 TestRunning Time Dom(E) = UTF1647 ms Dom(EoD) = UTF16109 ms Dom(D) = UTF8156 ms Dom(DoE) = UTF8320 ms EoD=Identity16 ms DoE=Identity24 ms Complete analysis in about a second
111
BASE64 Base64 encoder (E) and decoder (D) 111 TestRunning Time Dom(E) = bytes13 ms Dom(EoD) = bytes55ms Dom(D) = 6bits+76 ms Dom(DoE) = 6bits+56 ms EoD=Identity53 ms DoE=Identity19 ms
112
112 Cartesian Extended Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? EoD=I Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE
113
113 B EX is a domain-specific language for writing string encoders B EX can model programs without approximation using Cartesian extended symbolic finite transducers B EX was evaluated using real-world string encoders Conclusion BEX IN A NUTSHELL
114
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 114
115
FAST ANALYSIS OF PROGRAMS MANIPULATING TREES Loris D’AntoniMargus VeanesBen LivshitsDavid Molnar [PLDI14]
116
116
117
SOLUTION: USE AN HTML SANITIZER Remove malicious active code from HTML documents SANITIZE 117 alert(“This is Sparta!”); I swear this HTML is safe! I swear this HTML is safe!
118
TYPICAL TRANSFORMATIONS Remove scripts Remove malicious URLs Replace deprecated tags Given a sanitizer S: Does S always produce a safe and well-formed output? Is S defined on every possible HTML file? Does executing S twice produce the same output as executing S once? Can we execute S fast? 118 Typical transformations Interesting questions
119
HOW DO WE WRITE ONE? 119 DEMODEMO: http://rise4fun.com/Fast/2 1
120
120
121
121
122
122
123
123
124
124
125
KEY IDEA: HTML CODE IS A TREE body script malicious code div p I swear this HTML is safe! 125 SANITIZE body div p I swear this HTML is safe!
126
MOTIVATION Trees are common input/output data structures – XML query, type-checking, etc… – Compilers/optimizers (from parse tree to parse tree) – Tree manipulating programs: data structures algorithms, ontologies, etc… 126
127
127 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE
128
CHOOSING THE RIGHT FORMALISM 128
129
SEMANTICS AS TRANSDUCERS Goal: find a class of tree transducers that can express the previous examples and is closed under composition 129
130
TOP DOWN TREE TRANSDUCERS [ENGELFRIET75] q(a(x 1,x 2 )) b(c,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: only finite alphabets ab c q q1q1 x1x1 x2x2 x1x1 130
131
SYMBOLIC TREE TRANSDUCERS [PSI11] q(λa.a>3,(x 1,x 2 )) λa.a+1,(λa.a-2,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets using predicates and functions Structural expressiveness: can’t delete a node without reading it first 55+1 5-2 q q1q1 x1x1 x2x2 x1x1 Such that 5>3 is true 131 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
132
IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the left child if it contains a script If we delete the node we can’t check that the left child contained a script divq q 132 Regular Look-Ahead (RLA) ??
133
REGULAR LOOK AHEAD : Transformation: delete the left child if it contains a script Rules can ask whether the children are in particular languages – p 1 : the language of trees that contain a script node – p 2 : the language of all trees Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets Structural expressiveness: good enough to express our examples div q p1p1 p2p2 q Transformation now is safe 133
134
DecidabilityComplexityStructuralExpressiveness Infinite alphabets Top Down Tree Transducers [Engelfriet75]VVXX Top Down Tree Transducers with Regular Look-ahead [Engelfriet76]VV~X Streaming Tree Transducers [AlurDantoni12]VXVX Data Automata [Bojanczyk98]~XXV Symbolic Tree Transducers [VeanesBjoerner11]VVXV Symbolic Tree Transducers RLAVV~V 134
135
COMPOSITION OF STT R This is not always possible!! Find the biggest class for which it is possible 135 T1T1 T1T1 T2T2 T2T2 T 1 o T 2
136
WHEN CAN WE COMPOSE? Theorem: T(x) = T 2 (T 1 (x)) definable by a Symbolic Tree Transducers with RLA if – T 1 is deterministic All our examples fall in this category 136 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability
137
137 Symbolic Tree Transducers with RLA Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE
138
CASE STUDIES AND EXPERIMENTS 138
139
CASE STUDIES AND EXPERIMENTS Program Optimization: Deforestation of functional programs Verification: HTML sanitization Analysis of functional programs Augmented reality app store 139 Infinite Alphabets: Integer Data types
140
DEFORESTATION Removing intermediate data structures from programs ADVANTAGE: the program is a single transducer reads the input list only once, thanks to transducers composition 140 alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) } def mapC 2 : IList IList := compose mapC mapC
141
STAGES BY EXAMPLE 141 mapCmapC2 Transducers
142
DEFORESTATION: SPEEDUP 142 f(f(f(…f(x)...) (f;f;f;…;f)(x)
143
ANALYSIS OF FUNCTIONAL PROGRAMS 143
144
AR INTERFERENCE ANALYSIS Recognizers output data that can be seen as a tree structure Spine Hip Neck HeadKnee Ankle Foot …. 144
145
APPS AS TREE TRANSFORMATIONS Applications that use recognizers can be modeled as FAST programs 145 trans addHat: STree -> STree Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h), l, r) | Head(a) to Head(Hat(a))
146
COMPOSITION OF PROGRAMS Two FAST programs can be composed into a single FAST program p1p1p1p1 p2p2p2p2 p 1 ;p 2 146
147
ANOTHER RECOGNIZER 147 Room Floor Wall Table Chair …. Chair ….
148
INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite outputs of other apps Apps interfere when they annotate the same node of a recognizer’s output We can compose them and check if they interfere statically!! – Put checker in the AppStore and analyze Apps before approval Interfering apps Add cat earsAdd hat Add pin to a cityBlur a city Amazon Buy Now button Malicious Buy Now button 148
149
INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs, up to 85 functions each Check statically if they conflict pairwise for ANY possible input Checked 99% of program pair in less than 0.5 sec! For an App store these are perfectly fine
150
TWO PENDING PATENTS 150
151
151 F AST is a domain-specific language for writing tree manipulating programs F AST can model programs without approximation using Symbolic tree transducers with regular lookahead F AST was evaluated using real-world programs Conclusion FAST IN A NUTSHELL
152
OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 152
153
WHAT’S NEXT 153
154
FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 154
155
DREX EFFICIENT STRING MANIPULATION Loris D’Antoni Mukund Raghothaman Here at POPL15! Rajeev Alur
156
DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B) 156 a b a b b/b (a|b)*b iterate(choice(a->a, b->b)) a/a Execute this code in linear time left- to-right pass on the input string!!
157
BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes Here at POPL 15!! David MolnarBen Livshits Todd Mytkowicz
158
FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B) Efficient data-parallel code 158 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 02
159
PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Here at POPL 15!! Loris D’Antoni David Molnar Benjamin Livshits Margus Veanes Robert Cochran
160
CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) 160 Specification
161
YOU CAN HELP TOO! 161
162
INTERESTING DIRECTIONS A transducer-based language for – WebSrapers – Spradsheet transformations – Compiler optimizations – XML processing – Html rendering 162
163
SUMMARIZING… 163
164
164 Transducer Model Z3 Transformation Analysis Does it do the right thing? Analysis question Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR RECIPE FOR EACH TASK
165
BEK Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string encoders and decoders D’Antoni, Veanes, VMCAI13 Equivalence of extended symbolic finite transducers D’Antoni, Veanes, CAV13 Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 FAST Fast: a transducer based language for tree manipulatio D’Antoni, Veanes, Livshits, Molnar, PLDI14 165
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.