Presentation is loading. Please wait.

Presentation is loading. Please wait.

PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes.

Similar presentations


Presentation on theme: "PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes."— Presentation transcript:

1 PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes

2 2

3 3

4 4

5 5

6 6 All features of general purpose language Features needed replace, match, char…

7 FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 7

8 OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 8

9 AUTOMATA, TRANSDUCERS, AND PROGRAMS 9

10 FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 10

11 11 type alphabet = A | T | C | G let rec all_TG (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = T || h = G) && (all_TG t ) let rec all_AC (l: base list) : bool = match l with [ ] -> true | h : : t -> (h = A || h = C) && (all_TG t ) let rec map_base (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> T : : ( map_base t ) | T : : t -> A : : ( map_base t ) | G : : t -> C : : ( map_base t ) | C : : t -> G : : ( map_base t ) let rec filter_AC (l: base list) : base list = match l with [ ] -> [ ] | A : : t -> A : : ( filter_AC t ) | T : : t -> filter_AC t | G : : t -> filter_AC t | C : : t -> C : : ( filter_AC t ) Finite alphabet Languages of strings Transformations from strings to strings q0q0 T G q0q0 A C all_TGall_AC ε A/T map_base T/A G/CC/G ε A/A T/ε G/εC/C filter_AC

12 FINITE AUTOMATA 12 a b a b ababYes abaNo bbYes aNo

13 FINITE STATE TRANSDUCERS 13 a/aa b/bb zz a/aa b/bb abaabbzz bbbzz abaUNDEFINED a

14 BENEFITS OF AUTOMATA AND TRANSDUCERS Closure and decidability for automata: Intersection, union, complement Decidable emptiness Decidable equivalence Can be minimized 14

15 BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer composition let m_f_DNA l : base list = filter_AC (map_base l) 15 q0q0 A/T map_base T/A G/CC/G q0q0 A/AT/ε G/εC/C filter_AC q0q0 A/εT/ A G/CC/ε m_f_DNA

16 BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking map_base o (¬ all_AC) 16 input in all_TG map_base output in all_AC map_base only defined if output in (¬ all_AC)

17 BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) 17 input in all_TG map_base output in all_AC Inputs for which map_base does not output in all_AC

18 BENEFITS OF AUTOMATA AND TRANSDUCERS Type-checking dom(map_base o (¬ all_AC)) ∩ all_TG = ∅ 18 input in all_TG map_base output in all_AC

19 BENEFITS OF AUTOMATA AND TRANSDUCERS Transducer equivalence let m_f_DNA l : base list = filter_AC (map_base l) let f_m_DNA l : base list = map_base (filter_AC l) Is m_f_DNA equivalent to f_m_DNA ? 19

20 FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task it is simple to use enables to automatically reason about what the programs do compiles into efficient code 20

21 OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 21

22 [USENIX11, POPL12] P. HooimeijerM. VeanesB. LivshitsD. Molnar BEK analysis of string sanitizers P. Saxena

23 23

24 24

25 25 Q UESTION : What could possibly go wrong?

26 26 Attacker: gollum.png' onload='javascript:...

27 27 Attacker: gollum.png' onload='javascript:... Result: <img src='gollum.png' onload='javascript:…

28 28 Attacker: im.png' onload='javascript:... Result: <img src='im.png' onload='javascri I found my PRECIOUSS S.

29 29

30 FIRST LINE OF DEFENSE: SANITIZERS Sanitizer: a string transformation function. PLDI'12 submission presentations 30 “im.png' …”“img.png' …” Sanitized dataUntrusted data Dec 8, 2011

31 COMPARING SANITIZERS 31

32 32 ' ' single quote html entity

33 33 some untrusted input

34 34 Library A Name: Around for: Availability: HtmlEncode Years Readily available to C# developers some untrusted input

35 35 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers some untrusted input

36 36 Library A Name: Around for: Availability: Library B Name: Around for: Availability: HtmlEncode Years Readily available to C# developers HtmlEncode Years Readily available to C# developers ' ' ' ' ✔ ✘

37 37 public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); }.NET WebUtility MS AntiXSS private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); }

38 private static string HtmlEncode(string input, bool useNamedEntities, MethodSpecificEncoder encoderTweak) { if (string.IsNullOrEmpty(input)) { return input; } if (characterValues == null) { InitialiseSafeList(); } if (useNamedEntities && namedEntities == null) { InitialiseNamedEntityList(); } // Setup a new character array for output. char[] inputAsArray = input.ToCharArray(); int outputLength = 0; int inputLength = inputAsArray.Length; char[] encodedInput = new char[inputLength * 10]; SyncLock.EnterReadLock(); try { for (int i = 0; i < inputLength; i++) { char currentCharacter = inputAsArray[i]; int currentCodePoint = inputAsArray[i]; char[] tweekedValue; // Check for invalid values if (currentCodePoint == 0xFFFE || currentCodePoint == 0xFFFF) { throw new InvalidUnicodeValueException(currentCodePoint); } else if (char.IsHighSurrogate(currentCharacter)) { if (i + 1 == inputLength) { throw new InvalidSurrogatePairException(currentCharacter, '\0'); } // Now peak ahead and check if the following character is a low surrogate. char nextCharacter = inputAsArray[i + 1]; char nextCodePoint = inputAsArray[i + 1]; if (!char.IsLowSurrogate(nextCharacter)) { throw new InvalidSurrogatePairException(currentCharacter, nextCharacter); } // Look-ahead was good, so skip. i++; // Calculate the combined code point long combinedCodePoint = 0x10000 + ((currentCodePoint - 0xD800) * 0x400) + (nextCodePoint - 0xDC00); char[] encodedCharacter = SafeList.HashThenValueGenerator(combinedCodePoint); encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (char.IsLowSurrogate(currentCharacter)) { throw new InvalidSurrogatePairException('\0', currentCharacter); } else if (encoderTweak != null && encoderTweak(currentCharacter, out tweekedValue)) { for (int j = 0; j < tweekedValue.Length; j++) { encodedInput[outputLength++] = tweekedValue[j]; } else if (useNamedEntities && namedEntities[currentCodePoint] != null) { char[] encodedCharacter = namedEntities[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else if (characterValues[currentCodePoint] != null) { // character needs to be encoded char[] encodedCharacter = characterValues[currentCodePoint]; encodedInput[outputLength++] = '&'; for (int j = 0; j < encodedCharacter.Length; j++) { encodedInput[outputLength++] = encodedCharacter[j]; } encodedInput[outputLength++] = ';'; } else { // character does not need encoding encodedInput[outputLength++] = currentCharacter; } finally { SyncLock.ExitReadLock(); } return new string(encodedInput, 0, outputLength); } public static string HtmlEncode(string s) { if (s == null) return null; int num = IndexOfHtmlEncodingChars(s, 0); if (num == -1) return s; StringBuilder builder=new StringBuilder(s.Length+5); int length = s.Length; int startIndex = 0; Label_002A: if (num > startIndex) { builder.Append(s, startIndex, num-startIndex); } char ch = s[num]; if (ch > '>') { builder.Append("&#"); builder.Append(((int) ch).ToString(NumberFormatInfo.InvariantInfo)); builder.Append(';'); } else { char ch2 = ch; if (ch2 != '"') { switch (ch2) { case '<': builder.Append("<"); goto Label_00D5; case '=': goto Label_00D5; case '>': builder.Append(">"); goto Label_00D5; case '&': builder.Append("&"); goto Label_00D5; } else { builder.Append("""); } Label_00D5: startIndex = num + 1; if (startIndex < length) { num = IndexOfHtmlEncodingChars(s, startIndex); if (num != -1) { goto Label_002A; } builder.Append(s, startIndex, length-startIndex); } return builder.ToString(); } 38.NET WebUtility MS AntiXSS Same behavior on all inputs? If not, what is a differentiating input? Can it generate any known ‘bad’ outputs?

39 39 PHP Trunk Changes to html.c, 1999--2011

40 40 PHP Trunk Changes to html.c, 1999—2011 R7,841 April 1999 135 loc R309,482 March 2011 1693 loc

41 41 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R7,841 April 1999 135 loc R309,482 March 2011 1693 loc

42 42 PHP Trunk Changes to html.c, 1999—2011 R32,564 September 2000 ENT_QUOTES introduced R242,949 September 2007 $double_encode=true R7,841 April 1999 135 loc R309,482 March 2011 1693 loc

43 43 PHP Trunk Changes to html.c, 1999—2011 Safe to apply twice? Safe to combine with other sanitizers?

44 MOTIVATION 44 Writing string sanitizers correctly is difficult There is no cheap way to identify problems with sanitizers ‘Correctness’ is a moving target What if we could say more about sanitizer behavior?

45 CONTRIBUTIONS 45 B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation

46 CONTRIBUTIONS 46 B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity

47 47 s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE

48 48 Symbolic Finite Transducers Z3 Transformation Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE

49 49 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program BEK ARCHITECTURE

50 50 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

51 51 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

52 52 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES

53 53 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s

54 54 escape := iter(c in s)[b := false;] { case (!b && c in "['\"]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; A BEK PROGRAM: ESCAPE QUOTES iterate over the characters in string s while updating one boolean variable b Simple dedicated syntax

55 55 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

56 FINITE STATE TRANSDUCERS 56 a/A Problem: alphabet has 2 16 characters TOO MANY TRANSITIONS b/B z/Z … … &/&

57 SYMBOLIC FINITE TRANSDUCERS 57 Only two transitions!! x in [a-z] / x-32 x not in [a-z] / x

58 SYMBOLIC FINITE TRANSDUCERS 58 x>5/x+1,x x%2=1/x-1,x,x+4 true/5 true/x-4 Predicates Sequence of functions Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

59 59 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

60 60 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen Now what? BEK ARCHITECTURE

61 SFT Algorithms 61 EQUIVALENCE CHECKING IS DECIDABLE! Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

62 SFT Algorithms 62 AntiXSS.HtmlEncode = WebUtility.HtmlEncode EQUIVALENCE CHECKING

63 63 SFT A  B inout SFT A inout SFT B CLOSED UNDER COMPOSITION

64 SFT Algorithms 64 SFT A  B inout SFT A inout SFT B JavaScriptEncode(HtmlEncode(w)) = HtmlEncode(JavaScriptEncode(w)) COMPOSITION

65 65 PRE-IMAGE COMPUTATION Regular Language O Regular Language I outin SFT A

66 66 PRE-IMAGE COMPUTATION MALICIOUS INPUTS Vulnerability signature outin SFT A

67 67 B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation B EK  Frontend: a small language for string manipulation; similar to how sanitizers are written today  Backend: a model based on symbolic finite transducers with algorithms for analysis and code generation Contributions Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity Evaluation  Converted sanitizers from a variety of sources  Checked properties like reversibility, idempotence, equivalence, and commutativity CONTRIBUTIONS

68 68 Can BEK model existing sanitizers? Can we use to check interesting properties on real sanitizers? QUESTIONS?

69 Language Features 69 Data: 1x OWASP HTMLencode 13x Google AutoEscape 21x IE 8 XSS Filter 7x Synthetic inspect feature counts WHAT FEATURES ARE NEEDED?

70 Language Features 70 Majority (76%) of sanitizers can be ported without extending the language With multi-character lookahead: 90% WHAT FEATURES ARE NEEDED?

71 71 Data 4x MS internal HtmlEncode 3x ‘for hire’ HtmlEncode based on English- language specification (C#) Commutative? Equivalent? CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?

72 72 Short answer: Yes! CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?

73 73 Short answer: Yes! EQ results take less than a minute to obtain: 1234567 1 ✔✔✔✘✘✔✘ 2 ✔✔✘✘✔✘ 3 ✔✘✘✔✘ 4 ✔✘✘✘ 5 ✔✘✘ 6 ✔✘ 7 ✔ CAN WE CHECK INTERESTING PROPERTIES ON REAL SANITIZERS?

74 74 CommutativitySelf-Equivalence DOES IT SCALE?

75 The Cheat Sheet 75 One out of seven implementations correctly encodes all strings for use in both HTML and attribute contexts WERE ALL SANITIZERS BROKEN?

76 76 B EK is a domain-specific language for writing string sanitizers B EK can model programs without approximation using symbolic finite transducers, enabling e.g., equivalence checks B EK was evaluated using real-world sanitizers from a variety of different sources Conclusion BEK IN A NUTSHELL

77 OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 77

78 BEX ANALYSIS OF STRING ENCODERS Loris D’AntoniMargus Veanes [VMCAI13, CAV13]

79 79 Hi, I’m plain text! Nice to meet you! SGkgSSdtIHBsYWluI HRleHQsIG5pY2Ugd G8gbWVldCB5b3Uh Encoder Decoder

80 NOT SO EASY TO GET RIGHT 80

81 WHEN ARE THEY CORRECT? 81 T Encoder T’ Decoder T Encoder T’T

82 CAN WE USE TRANSDUCERS? 82 T Encoder T’ Decoder T Encoder o Decoder = Identity

83 Language Features 83 Majority (76%) of sanitizers can be ported without extending Bek With multi-character lookahead: 90% BEK: WHAT FEATURES WERE NEEDED?

84 BASE64 encoder 3 Bytes  4 Base64 characters 84 Text contentMan Bytes7797110 Bit Pattern010011010110000101101110 Index1922546 Base64 EncodedTWFu

85 85 HOW DO WE EXTEND BEK?

86 86 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE Symbolic finite transducers don’t have registers 

87 TRANSDUCERS WITH REGISTERS 87 12 x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines 

88 88 EXPLORE REGISTERS VALUES Register has finitely many values: Remember last value 2 |bits| states 

89 89 BASE64 IN BEX DEMO

90 90

91 91 Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bek Program Code Gen C#JavaScriptC Code Gen BEK ARCHITECTURE

92 92 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE

93 EXTENDED SYMBOLIC FINITE TRANSDUCERS 93 Man… p 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3 …

94 EXTENDED SYMBOLIC FINITE TRANSDUCERS 94 Man… pq TWFu… 3 qp x 1 ≤FF ∧ x 2 ≤FF ∧ x 3 ≤FF / [ x 1 >>2, ((x 1 &3) >4), ((x 2 &0xF) >6), x 3 &0x3F ] x1x1 x2x2 x3x3

95 MORE EXPRESSIVE THAN SYMBOLIC FINITE TRANSDUCERS 95 10 x 1 >x 2 / [x 1 +x 2 ] Do they still have nice properties?

96 WHAT DO WE NEED? 96 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence

97 NEGATIVE RESULTS 97 ESFAs: – equivalence is undecidable – are not closed under intersection – are not closed under complement ESFTs – equivalence is undecidable – are not closed under composition

98 A FRIENDLIER RESTRICTION 98

99 CARTESIAN EXTENDED SYMBOLIC FINITE TRANSDUCERS 99 Negative results use binary predicates and encoders do not use this feature Only allow conjunctions of unary predicates q p x 1 <x 2 +1 q p x 1 >5 ∧ x 2 =1 / [x 1 +x 2, x 1 ]

100 CARTESIAN ESFA = SFA 100 Cartesian ESFAs are now equivalent to SFAs 10 x 1 >5 ∧ x 2 =1 0,1 0 x=1x>5 1

101 STILL MORE EXPRESSIVE THAN SFTS 101 Cartesian ESFTs are strictly more expressive than SFTs!! 10 x 1 >5 ∧ x 2 =1 / [x 1 +x 2 ] ?

102 WHAT DO WE NEED? 102 T Encoder T’ Decoder T Encoder o Decoder = Identity CompositionEquivalence

103 RESULTS 103 Cartesian ESFTs – equivalence is decidable – are not closed under composition

104 COMPOSITION IN PRACTICE 104

105 105 BEK WITH REGISTERS?

106 TRANSDUCERS WITH REGISTERS 106 12 x / [ r | (x>>6), x&0x3F ], r := 0 x / [ x>>2 ], r := (x&3)<<4 x / [r|(x>>4)], r := (x&0xF)<<2 0 Transducers with registers are closed under composition Equivalent to Turing Machines 

107 COMPOSING CARTESIAN ESFTS 107 A Cartesian ESFTs A’B’ B Transducers with registers A’ o B’ A o B Cartesian ESFT ?

108 REGISTER ELIMINATION 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 0 [x 1,x 2 ] / [ x 1 +4, x 1 -2+x 2, x 2 +1 ], r:=0 0 ESFT 108 2 2

109 DOES IT WORK? 109

110 UNICODE UTF8 to UTF16 encoder (E) and decoder (D) 110 TestRunning Time Dom(E) = UTF1647 ms Dom(EoD) = UTF16109 ms Dom(D) = UTF8156 ms Dom(DoE) = UTF8320 ms EoD=Identity16 ms DoE=Identity24 ms Complete analysis in about a second

111 BASE64 Base64 encoder (E) and decoder (D) 111 TestRunning Time Dom(E) = bytes13 ms Dom(EoD) = bytes55ms Dom(D) = 6bits+76 ms Dom(DoE) = 6bits+56 ms EoD=Identity53 ms DoE=Identity19 ms

112 112 Cartesian Extended Symbolic Finite Transducers Z3 Transformation Analysis Does it do the right thing? EoD=I Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Bex Program Code Gen C#JavaScriptC Code Gen BEX ARCHITECTURE

113 113 B EX is a domain-specific language for writing string encoders B EX can model programs without approximation using Cartesian extended symbolic finite transducers B EX was evaluated using real-world string encoders Conclusion BEX IN A NUTSHELL

114 OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 114

115 FAST ANALYSIS OF PROGRAMS MANIPULATING TREES Loris D’AntoniMargus VeanesBen LivshitsDavid Molnar [PLDI14]

116 116

117 SOLUTION: USE AN HTML SANITIZER Remove malicious active code from HTML documents SANITIZE 117 alert(“This is Sparta!”); I swear this HTML is safe! I swear this HTML is safe!

118 TYPICAL TRANSFORMATIONS Remove scripts Remove malicious URLs Replace deprecated tags Given a sanitizer S: Does S always produce a safe and well-formed output? Is S defined on every possible HTML file? Does executing S twice produce the same output as executing S once? Can we execute S fast? 118 Typical transformations Interesting questions

119 HOW DO WE WRITE ONE? 119 DEMODEMO: http://rise4fun.com/Fast/2 1

120 120

121 121

122 122

123 123

124 124

125 KEY IDEA: HTML CODE IS A TREE body script malicious code div p I swear this HTML is safe! 125 SANITIZE body div p I swear this HTML is safe!

126 MOTIVATION Trees are common input/output data structures – XML query, type-checking, etc… – Compilers/optimizers (from parse tree to parse tree) – Tree manipulating programs: data structures algorithms, ontologies, etc… 126

127 127 ? Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE

128 CHOOSING THE RIGHT FORMALISM 128

129 SEMANTICS AS TRANSDUCERS Goal: find a class of tree transducers that can express the previous examples and is closed under composition 129

130 TOP DOWN TREE TRANSDUCERS [ENGELFRIET75] q(a(x 1,x 2 ))  b(c,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: only finite alphabets ab c q q1q1 x1x1 x2x2 x1x1 130

131 SYMBOLIC TREE TRANSDUCERS [PSI11] q(λa.a>3,(x 1,x 2 ))  λa.a+1,(λa.a-2,q 1 (x 1 )) Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets using predicates and functions Structural expressiveness: can’t delete a node without reading it first 55+1 5-2 q q1q1 x1x1 x2x2 x1x1 Such that 5>3 is true 131 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

132 IMPROVING STRUCTURAL EXPRESSIVENESS Transformation: delete the left child if it contains a script If we delete the node we can’t check that the left child contained a script divq q 132 Regular Look-Ahead (RLA) ??

133 REGULAR LOOK AHEAD : Transformation: delete the left child if it contains a script Rules can ask whether the children are in particular languages – p 1 : the language of trees that contain a script node – p 2 : the language of all trees Decidable properties: type-checking, etc… Domain expressiveness: infinite alphabets Structural expressiveness: good enough to express our examples div q p1p1 p2p2 q Transformation now is safe 133

134 DecidabilityComplexityStructuralExpressiveness Infinite alphabets Top Down Tree Transducers [Engelfriet75]VVXX Top Down Tree Transducers with Regular Look-ahead [Engelfriet76]VV~X Streaming Tree Transducers [AlurDantoni12]VXVX Data Automata [Bojanczyk98]~XXV Symbolic Tree Transducers [VeanesBjoerner11]VVXV Symbolic Tree Transducers RLAVV~V 134

135 COMPOSITION OF STT R This is not always possible!! Find the biggest class for which it is possible 135 T1T1 T1T1 T2T2 T2T2 T 1 o T 2

136 WHEN CAN WE COMPOSE? Theorem: T(x) = T 2 (T 1 (x)) definable by a Symbolic Tree Transducers with RLA if – T 1 is deterministic All our examples fall in this category 136 Alphabet theory has to be DECIDABLE We’ll use Z3 to check predicate satisfiability

137 137 Symbolic Tree Transducers with RLA Z3 Transformation Analysis Does it do the right thing? Counterexample “\' vs. \\'” Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; Fast Program Code Gen C#JavaScriptC Code Gen FAST ARCHITECTURE

138 CASE STUDIES AND EXPERIMENTS 138

139 CASE STUDIES AND EXPERIMENTS Program Optimization: Deforestation of functional programs Verification: HTML sanitization Analysis of functional programs Augmented reality app store 139 Infinite Alphabets: Integer Data types

140 DEFORESTATION Removing intermediate data structures from programs ADVANTAGE: the program is a single transducer reads the input list only once, thanks to transducers composition 140 alphabet ILIst [i : int] { nil(0), cons(1) } trans mapC: IList  IList { nil() to nil [0] | cons(x) to cons [(i+5)%26] (mapC x) } def mapC 2 : IList  IList := compose mapC mapC

141 STAGES BY EXAMPLE 141 mapCmapC2 Transducers

142 DEFORESTATION: SPEEDUP 142 f(f(f(…f(x)...) (f;f;f;…;f)(x)

143 ANALYSIS OF FUNCTIONAL PROGRAMS 143

144 AR INTERFERENCE ANALYSIS Recognizers output data that can be seen as a tree structure Spine Hip Neck HeadKnee Ankle Foot …. 144

145 APPS AS TREE TRANSFORMATIONS Applications that use recognizers can be modeled as FAST programs 145 trans addHat: STree -> STree Spine(x,y) to Spine(addHat(x), y) | Neck(h,l,r) to Neck(addHat(h), l, r) | Head(a) to Head(Hat(a))

146 COMPOSITION OF PROGRAMS Two FAST programs can be composed into a single FAST program p1p1p1p1 p2p2p2p2 p 1 ;p 2 146

147 ANOTHER RECOGNIZER 147 Room Floor Wall Table Chair …. Chair ….

148 INTERFERENCE ANALYSIS Apps can be malicious: try to overwrite outputs of other apps Apps interfere when they annotate the same node of a recognizer’s output We can compose them and check if they interfere statically!! – Put checker in the AppStore and analyze Apps before approval Interfering apps Add cat earsAdd hat Add pin to a cityBlur a city Amazon Buy Now button Malicious Buy Now button 148

149 INTERFERENCE ANALYSIS IN PRACTICE 100 generated FAST programs, up to 85 functions each Check statically if they conflict pairwise for ANY possible input Checked 99% of program pair in less than 0.5 sec! For an App store these are perfectly fine

150 TWO PENDING PATENTS 150

151 151 F AST is a domain-specific language for writing tree manipulating programs F AST can model programs without approximation using Symbolic tree transducers with regular lookahead F AST was evaluated using real-world programs Conclusion FAST IN A NUTSHELL

152 OUTLINE Automata, transducers, and programs BEK and string sanitizers BEX and string encoders FAST and tree manipulating programs What’s next? 152

153 WHAT’S NEXT 153

154 FOR EACH DOMAIN SPECIFIC TASK Design a language that only has the features required by the task, it is simple to use enables to automatically reason about what the programs do compiles into efficient code 154

155 DREX EFFICIENT STRING MANIPULATION Loris D’Antoni Mukund Raghothaman Here at POPL15! Rajeev Alur

156 DECLARATIVE LANGUAGE FOR STRING SCRIPTS (15/1, 2PM, SEC. 2B) 156 a b a b b/b (a|b)*b iterate(choice(a->a, b->b)) a/a Execute this code in linear time left- to-right pass on the input string!!

157 BEX 2.0 PARALLEL EXECUTION OF STRING ENCODERS Margus Veanes Here at POPL 15!! David MolnarBen Livshits Todd Mytkowicz

158 FROM TRANSDUCERS TO PARALLEL EXECUTIONS (15/1, 2PM, SEC. 2B) Efficient data-parallel code 158 12 x / [ r+x, x+1], r := 0 x / [ x+4 ], r := (x-2) 02

159 PROGRAM BOOSTING OR CROWD-SOURCING FOR CORRECTNESS Here at POPL 15!! Loris D’Antoni David Molnar Benjamin Livshits Margus Veanes Robert Cochran

160 CROWD-SOURCING PROGRAMS WITH AUTOMATA (17/1, 4PM, SEC. 9B) 160 Specification

161 YOU CAN HELP TOO! 161

162 INTERESTING DIRECTIONS A transducer-based language for – WebSrapers – Spradsheet transformations – Compiler optimizations – XML processing – Html rendering 162

163 SUMMARIZING… 163

164 164 Transducer Model Z3 Transformation Analysis Does it do the right thing? Analysis question Microsoft.Automata s := iter(c in t)[b := false;] { case (!b && c in "[\"\\]"): b := false; yield('\\', c); case (c == '\\'): b := !b; yield(c); case (true): b := false; yield(c); }; DSL Code Gen C#JavaScriptC Code Gen OUR RECIPE FOR EACH TASK

165 BEK Fast and precise sanitizer analysis with BEK Hooimeijer, Livshits, Molnar, Saxena, Veanes, USENIX11 Symbolic finite state transducers: algorithms and applications Veanes, Hooimeijer, Livshits, Molnar, Bjorner, POPL12 BEX Static analysis of string encoders and decoders D’Antoni, Veanes, VMCAI13 Equivalence of extended symbolic finite transducers D’Antoni, Veanes, CAV13 Data parallel string manipulating programs Veanes, Mytkowicz, Molnar, Livshits, POPL15 FAST Fast: a transducer based language for tree manipulatio D’Antoni, Veanes, Livshits, Molnar, PLDI14 165


Download ppt "PROGRAMMING USING AUTOMATA AND TRANSDUCERS Loris D’AntoniMargus Veanes."

Similar presentations


Ads by Google